What is inside

Objective

Gathering Data

Prepare Data for Modeling

- Supervised/Unsupervised/Regression/Classfication
- Data Visualization
- Data Cleaning: Missing Values, Outlier Removal
- Feature Extraction: Interaction, Lagging, One-Hot Encoding
- Feature Selection: Univariate/Recursive
- Split data into train/test
- Under Sampling, Over Sampling
- Scaling, Normalizing, outlier removal

Baseline Modeling

- Select model
    - Linear Regression, Logistic Regression
    - K-nearest neighbors, Decision Tree
    - Support Vector Machine, Random Forest
- Parameteric Models: Multicolinearity, Relation with the outcome
- Evaluation Metric: ROC AUC, PR AUC, Accuracy, R2, MSE, RMSE, RSS, MAE
- Fit model on train-set
- Test model on test-set
- Feature importances, ANOVA table(stats_models), Coefficents

Secondary Modeling

- Reduce Overfitting
    - Hyperparameter Tuning: L1, L2 penality
- Gradient Boosting, XGBoost, Custom Ensembles
- Improve generalization error
In [2]:
import pandas as pd
import pickle
from geopy.distance import great_circle
import operator
from scipy import sparse
from scipy.sparse import csr_matrix, vstack, hstack, coo_matrix
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

%matplotlib inline

1. Objective

Predict whether a restaurant will be closed or not based on its geographical location and user reviews. Most businesses don't just open and close in short period of time. They stay for years before closing. The newest Yelp Dataset was released before 6 months. It contains data about businesses that were opened and closed since 2004. Thus, it will be useful to predict whether a restaurant will close in the next few years. Banks can use this model to make informed loan decisions. Because loans are mostly long-term. Investors want to put their money on a restaurant that is likely to remain for many years.

I searched for related works that try to predict restaurants closure, The project I am aiming to develop is unique in the following aspects:

  • It can better generalize across cities because it is trained on 11 meteropolitan areas. The related projects were focused on one city.

  • It aims at improving perfomance by combining bag of words with other features, like the surrounding performance. Researchers at Univeristy of Maryland worked on the textual features only and Michail Alifierakis used features related to the surrounding performance only. He didn't encoporate textual features.

  • It uses the most recent version of the dataset which was released before 6 months.

2. Gathering Data

This dataset is a subset of Yelp's businesses, reviews, and user data. In total, there are :

  • 5,200,000 user reviews
  • Information on 174,000 businesses
  • The data spans 11 metropolitan areas in 4 countries

It is a huge dataset. I beleive that more data contributes in imporivng model generalization.Loading the data to memory at once is impossible. I considered the following options

  • a machine learning library other than pandas
  • dividing the data into chunks and working one chunk at a time
  • database
  • cloud

Pandas can read in chunks.I utilized Google Cloud Platform to create a virtual machine with 8 CPUs and 56 GB RAM. The project involved trying models on sets of features, grid search acorss pipelines and processing textual data. The cloud compute engine really helped me see results quickly.

In [3]:
b = pd.read_json("yelp_academic_dataset_business.json", lines=True)

Business

Contains business data including location data, attributes, and categories.

In [4]:
b.head(6)
Out[4]:
address attributes business_id categories city hours is_open latitude longitude name neighborhood postal_code review_count stars state
0 1314 44 Avenue NE {'BikeParking': 'False', 'BusinessAcceptsCredi... Apn5Q_b6Nz61Tq4XzPdf9A Tours, Breweries, Pizza, Restaurants, Food, Ho... Calgary {'Monday': '8:30-17:0', 'Tuesday': '11:0-21:0'... 1 51.091813 -114.031675 Minhas Micro Brewery T2E 6L6 24 4.0 AB
1 {'Alcohol': 'none', 'BikeParking': 'False', 'B... AjEbIBw6ZFfln7ePHha9PA Chicken Wings, Burgers, Caterers, Street Vendo... Henderson {'Friday': '17:0-23:0', 'Saturday': '17:0-23:0... 0 35.960734 -114.939821 CK'S BBQ & Catering 89002 3 4.5 NV
2 1335 rue Beaubien E {'Alcohol': 'beer_and_wine', 'Ambience': '{'ro... O8S5hYJ1SMc8fA4QBtVujA Breakfast & Brunch, Restaurants, French, Sandw... Montréal {'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0'... 0 45.540503 -73.599300 La Bastringue Rosemont-La Petite-Patrie H2G 1K7 5 4.0 QC
3 211 W Monroe St None bFzdJJ3wp3PZssNEsyU23g Insurance, Financial Services Phoenix None 1 33.449999 -112.076979 Geico Insurance 85003 8 1.5 AZ
4 2005 Alyth Place SE {'BusinessAcceptsCreditCards': 'True'} 8USyCYqpScwiNEb58Bt6CA Home & Garden, Nurseries & Gardening, Shopping... Calgary {'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ... 1 51.035591 -114.027366 Action Engine T2H 0N5 4 2.0 AB
5 20235 N Cave Creek Rd, Ste 1115 {'BikeParking': 'True', 'BusinessAcceptsCredit... 45bWSZtniwPRiqlivpS8Og Coffee & Tea, Food Phoenix {'Monday': '5:30-20:0', 'Tuesday': '5:30-20:0'... 1 33.671375 -112.030017 The Coffee Bean & Tea Leaf 85024 63 4.0 AZ

See Categories more closely

In [5]:
b.categories.sample(5)
Out[5]:
143505            Tanning, Beauty & Spas, Day Spas, Massage
111884                             Restaurants, Steakhouses
186651                                 Italian, Restaurants
103323                    Automotive, Auto Parts & Supplies
153653    Health & Medical, Massage, Day Spas, Reflexolo...
Name: categories, dtype: object

The Yelp Category List

  • Active Life
  • Arts & Entertainment
  • Automotive
  • Beauty & Spas
  • Education
  • Event Planning & Services
  • Financial Services
  • Food
  • Health & Medical
  • Home Services
  • Hotels & Travel
  • Local Flavor
  • Local Services
  • Mass Media
  • Nightlife
  • Pets
  • Professional Services
  • Public Services & Government
  • Real estate
  • Religious Organizations
  • Restaurants
  • Shopping

Each one of these contain other specific categories

Restaurants

  • Afghan
  • African
    • Senegalese
    • South African
  • American (New)
  • American (Traditional)
  • Arabian
  • Argentine
  • Armenian

Attributes are stored as a nested JSON object.

In [6]:
b.attributes[1005]
Out[6]:
{'BikeParking': 'True',
 'BusinessAcceptsCreditCards': 'True',
 'BusinessParking': "{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
 'ByAppointmentOnly': 'True',
 'GoodForKids': 'True',
 'HairSpecializesIn': "{'coloring': True, 'africanamerican': False, 'curly': True, 'perms': True, 'kids': True, 'extensions': True, 'asian': False, 'straightperms': True}",
 'RestaurantsPriceRange2': '2',
 'WheelchairAccessible': 'True'}
In [7]:
b.columns
Out[7]:
Index(['address', 'attributes', 'business_id', 'categories', 'city', 'hours',
       'is_open', 'latitude', 'longitude', 'name', 'neighborhood',
       'postal_code', 'review_count', 'stars', 'state'],
      dtype='object')

All the attributes are displayed as follows. Star rating average of all stars recieved by the business, rounded to half-stars.

Check in

Checkins on a business.

In [8]:
csize = 1e2
checkin_reader = pd.read_json("yelp_academic_dataset_checkin.json", lines=True, chunksize=csize)
for i, ch in enumerate(checkin_reader):
    if(i > 0):
        break
In [9]:
ch.columns
Out[9]:
Index(['business_id', 'time'], dtype='object')
In [10]:
ch.head()
Out[10]:
business_id time
100 6FkCrxJMq5KIgavHRqtlbQ {'Tue-4': 1, 'Wed-14': 2, 'Thu-15': 1, 'Sun-16...
101 d24LIT55-aAr7l1X99vpjw {'Fri-1': 1, 'Fri-17': 1, 'Mon-19': 1, 'Tue-22...
102 2pmOI_mrn1ZWB88IeiOMkA {'Tue-16': 1, 'Tue-17': 1, 'Wed-17': 1, 'Wed-2...
103 KvUuXo2Fpt-rcWR13iLWHg {'Wed-0': 1, 'Sun-1': 1, 'Sat-13': 1, 'Mon-16'...
104 quYlYKkiZ5qRLdu-_hew7g {'Tue-17': 1, 'Sun-19': 1}

Review

Contains full review text data including the user_id that wrote the review and the business_id the review is written for.

The reviews data is more than 5 GB, uploading it directly took time. I decided to upload the data to an ftp server and transfer it from there. Python has a builtin libraries: ftplib and requests, which can read data from ftp server.

In [11]:
# # import neccessary libraries
# from ftplib import FTP
# import requests

# # login to ftp server
# server = "70.40.217.80"
# username = "tinsaecares@tinsaealemayehu.com"
# password = "[RQ-K$%[1m;N"
# ftp = FTP(server)
# ftp.login(user=username, passwd=password)


# rvfile = open("yelp_academic_dataset_tip.json", "wb")
# ftp.retrbinary("yelp_academic_dataset_tip.json", rvfile.write)
# rvfile.close()
In [12]:
# # Downloading directly from yelp to computer
# !wget -c "https://storage.googleapis.com/kaggle-datasets/10100/277695/yelp-dataset.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1551046405&Signature=p%2BBda9sOLceu8EceqHAhrOPKgraMDDfZ8%2FqiW1z8DygzgoJaUzBW5xL8MWtD5z55OVJWe61Zv9qpdVeHCXcZ0r4mpV7dezhwA2Mmus2OyJqAhhWoXxPSieNB36fHaFmrm6xu%2FOEp5R1TJ2s72vWKU7tnOJS9%2BFBYnvOuV6cPN4lrpVVTrixOjCLv8QQWndfGR7V1Wcz2MqLaqAVFWuN94WU6ycz%2BlcSvdgqPplNHmmzcT%2BBvepxcuTheWMPuHusGUPJ1FHHQ4BEv8RgrhtTvXtV6jxupyw69eMHWpdN0J4bWHACc2%2BGthxgK00Tplt%2FrINAFiPATMf7dpciyNKigAA%3D%3D" -O yelp.zip
In [13]:
csize = 1e2
review_reader = pd.read_json("yelp_academic_dataset_review.json", lines=True, chunksize=csize)
for i, r in enumerate(review_reader):
    if(i > 0):
        break
In [14]:
r.columns
Out[14]:
Index(['business_id', 'cool', 'date', 'funny', 'review_id', 'stars', 'text',
       'useful', 'user_id'],
      dtype='object')
In [15]:
r.head()
Out[15]:
business_id cool date funny review_id stars text useful user_id
100 7wHLFohwCw8l6WS-feLjeg 1 2017-04-05 0 5k9F53Lanw09xR3nwCdRYg 5 I had an amazing time here. They were very bus... 1 Yy_iGXxLpL6tYDQoE-6XVg
101 ixAh9crILnJ9tM8LhWFhkw 0 2017-04-05 0 Fv1cqziL1JShSztJWYkTcA 4 I've been coming here for a while and I love h... 0 Yy_iGXxLpL6tYDQoE-6XVg
102 aQ222ydz_GSRZV66xNt4kQ 0 2017-04-09 1 AxhxGl41DItKjqkzYtr53Q 1 2 of my girlfriends recommended this place so ... 1 Yy_iGXxLpL6tYDQoE-6XVg
103 owLBKLyHe85xPba24bmZRw 0 2016-01-23 0 lAyATpsxALW9owlSKhSozA 5 Vickie makes me feel so comfortable every time... 0 Yy_iGXxLpL6tYDQoE-6XVg
104 s3i73_ttk_F33AEdqMr18g 0 2017-01-31 0 _qIXejLYgWGYYtcGcMw1eg 3 The halo-halo was good! I got the mango toast ... 0 Yy_iGXxLpL6tYDQoE-6XVg

See the text, stars, date and user. A review can receive votes from other users: useful, cool, funny. The business that is reviewed, the user who reviewed are linked by business_id and review_id respectively.

Tip

Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.

In [16]:
csize=1e2
tip_reader = pd.read_json("yelp_academic_dataset_tip.json", lines=True, chunksize=csize)
for i, t in enumerate(tip_reader):
    if(i > 0):
        break
In [17]:
t.columns
Out[17]:
Index(['business_id', 'compliment_count', 'date', 'text', 'user_id'], dtype='object')
In [18]:
t.head()
Out[18]:
business_id compliment_count date text user_id
100 LsyW2kRAP0HYYP0Qov4Kng 0 2014-09-12 22:30:50 Free naan if you stay and eat Rep7MccZrj-iHLS_hBHr1A
101 mdUnUYQoXTouoziuLzaR0g 0 2012-03-21 12:43:01 Ask for Sharon if you're getting a pedicure! BceBAi4nnMWiUsKALGMg3A
102 R9cM1tmBkQ03lnbE_a-U4Q 0 2016-12-09 17:51:53 Fast service. Cute staff. Great value. qPganFH_fTXrP9r3kXf6YQ
103 dSEpkUP_RuR5gG_LhJ_INA 0 2012-03-23 05:50:31 Well lit for a nice evening walk k2boyvGyjpQ2PhWVCNM3CQ
104 THO77IL6DLob9Agt9QCjsw 0 2018-03-22 04:56:17 Still going to Anna! Schedule online! ouk36OGbx25nO23b10L5jw

User

User data including the user's friend mapping and all the metadata associated with the user.

In [19]:
user_reader = pd.read_json("yelp_academic_dataset_user.json", lines=True, chunksize=csize)
csize=1e2
for i, u in enumerate(user_reader):
    if(i > 0):
        break
In [20]:
u.columns
Out[20]:
Index(['average_stars', 'compliment_cool', 'compliment_cute',
       'compliment_funny', 'compliment_hot', 'compliment_list',
       'compliment_more', 'compliment_note', 'compliment_photos',
       'compliment_plain', 'compliment_profile', 'compliment_writer', 'cool',
       'elite', 'fans', 'friends', 'funny', 'name', 'review_count', 'useful',
       'user_id', 'yelping_since'],
      dtype='object')
In [21]:
u.head()
Out[21]:
average_stars compliment_cool compliment_cute compliment_funny compliment_hot compliment_list compliment_more compliment_note compliment_photos compliment_plain ... cool elite fans friends funny name review_count useful user_id yelping_since
100 3.00 0 0 0 0 0 0 0 0 0 ... 0 0 k7QixRwahZavaQ_Rhd9s_w, _gLR-74C8aY-lIP_u3vFow... 1 Carlos 4 4 B2CkkEX341HLK3zbn3qcgQ 2014-06-22 06:54:32
101 3.73 3441 8 3441 2097 22 573 815 2468 2194 ... 70825 2016,2017,2018 379 -FZBTkAZEXoP7CYvRV2ZwQ, yDgBhl1QH-z7UMb6i98SOA... 49550 Alan 1379 78486 fRJpK_b0rrjpBgRZjvfvgA 2016-07-24 00:26:52
102 5.00 0 0 0 0 0 0 0 0 0 ... 1 2 w23fjhPXptVBdRyLkcbnPQ, AOfBV9hAzNXYFUzlkHxlAg... 0 Lisa 5 2 ethmTQIejtNt715NYme53g 2015-07-22 18:34:50
103 3.21 0 0 0 0 0 0 2 0 6 ... 28 0 hRbdn-DZAuqYv88bezKrZQ, q3IYh428CVOkU_W90sM3cg... 44 Flynn 70 127 xZAmw5gihOVO4duMN2Ju6Q 2013-03-25 20:22:45
104 3.56 14 5 14 28 0 0 12 17 30 ... 469 2012,2013,2014,2015,2016,2017,2018 54 uctq9SbdzSWBKJ92FO_foA, IcktNezCseEpno2NgRfZ_g... 430 Gigi 436 897 QFrfQw7FH96Vk6PFxxTLcA 2009-08-11 23:16:30

5 rows × 22 columns

useful, funny, cool exisit in user table. They refer to number of votes sent by the user. On review table these features were counting number of votes given to a review.

Users can recieve compliments from other users. A complement is given on the profile page. It is not given for a single review, rather on the overall activity of a user.

Compliment Type Profile

Prepare Data for Modeling

In [22]:
# the code below is just practice code

# from scipy.sparse import csr_matrix, vstack, hstack, coo_matrix
# from IPython.display import HTML, display
# import tabulate

# list1 = np.array([[1,0,0], [4,5,0]])
# list1 = [[1,0,0], [4,5,0]]
# matrix1 = csr_matrix(list1)
# display(HTML(tabulate.tabulate(list1, tablefmt='html')))
# print(matrix1)
# print()
# # stacking a new row
# matrix2 = vstack((matrix1, [0,9,0]))
# list2 = matrix2.toarray()
# display(HTML(tabulate.tabulate(list2, tablefmt='html')))
# print(matrix2)
# print()
# # stacking a new column on matrix1
# matrix3 = hstack((matrix1,[[0],[4.5]] ))
# list3 = matrix3.toarray()
# display(HTML(tabulate.tabulate(list3, tablefmt='html')))
# print(matrix3)
# print()
# # stacking a two columns on matrix1
# matrix4 = hstack((matrix1,[[0,9.9589],[4.5214, 0]] ))
# list4 = matrix4.toarray()
# display(HTML(tabulate.tabulate(list4, tablefmt='html')))
# print(matrix4)
# print()
# # rounding numpy matrices
# matrix4.data= np.round(matrix4.data,2)
# print(matrix4)

Do closed businesses have more nulls?

In [23]:
# nulls in percent
((b.isnull().sum()/len(b)) * 100).round(5)
Out[23]:
address          0.00000
attributes      13.67283
business_id      0.00000
categories       0.28686
city             0.00000
hours           23.75592
is_open          0.00000
latitude         0.00318
longitude        0.00318
name             0.00000
neighborhood     0.00000
postal_code      0.00000
review_count     0.00000
stars            0.00000
state            0.00000
dtype: float64
In [24]:
nulls_count = pd.DataFrame({"open":(b[b.is_open == True].isnull().sum()/len(b[b.is_open == False])).round(5), 
                           "closed":(b[b.is_open == False].isnull().sum()/len(b[b.is_open == False])).round(5)})

nulls_count = nulls_count[(nulls_count.open > 0) | (nulls_count.closed > 0)]
nulls_count["feature"] = nulls_count.index
nulls_count_melted = pd.melt(nulls_count, id_vars=['feature'], value_vars=['open', 'closed'])
sns.pointplot(x="feature", y="value", hue="variable", kind="point", data=nulls_count_melted)
sns.despine()
plt.show()

No, closed business have much lesser nulls

Class imbalance: Business

In [25]:
sns.set(rc={'figure.figsize':(8,6)})
sns.countplot(b.is_open)
plt.title("Imbalanced Classes")
plt.xticks([0,1], ("Closed", "Open"))
plt.xlabel("Class")
plt.ylabel("Frequency")
sns.despine()
plt.show()
In [26]:
pd.value_counts(b.is_open)/b.shape[0]
Out[26]:
1    0.830391
0    0.169609
Name: is_open, dtype: float64
  • 23% of hours is missing. I can drop hours

  • 0.29 % rows have missing category. I can drop those rows

  • 0.001 % rows have missing latitude, longitude information. Dropping them will not affect the data

  • I am not going to use the attributes feature.

In [27]:
b.drop("hours", axis=1, inplace=True)
b = b.drop(b[b.categories.isnull()].index)
b = b.drop(b[(b.latitude.isnull()) | (b.longitude.isnull()) ].index)
b = b.drop("attributes", axis=1)
# nulls in percent
((b.isnull().sum()/len(b)) * 100).round(5)
Out[27]:
address         0.0
business_id     0.0
categories      0.0
city            0.0
is_open         0.0
latitude        0.0
longitude       0.0
name            0.0
neighborhood    0.0
postal_code     0.0
review_count    0.0
stars           0.0
state           0.0
dtype: float64

Binarize Categories

A business is labeled to more than one categories. Converting them to binary will make filtering and analyzing very easy. Let me try to one-hot encode them.

In [28]:
b.categories.head()
Out[28]:
0    Tours, Breweries, Pizza, Restaurants, Food, Ho...
1    Chicken Wings, Burgers, Caterers, Street Vendo...
2    Breakfast & Brunch, Restaurants, French, Sandw...
3                        Insurance, Financial Services
4    Home & Garden, Nurseries & Gardening, Shopping...
Name: categories, dtype: object

Splitting by comma works except for businesses that have a single category seprated by commas.

  • Books, Mags, Music & Video
  • Used, Vintage & Consignment
  • Beer, Wine & Spirits
  • Wills, Trusts, & Probates

Since there is no syntatical difference beween single category containing comma and multiple categories separated by comma, i couln't find a regex expression to deal with this problem. Replacing manually is the remaining option.

In [29]:
b.loc[:, 'categories'] = b.categories.str.replace("Books, Mags, Music & Video", "Books+Mags+ & MusicVideo")
b.loc[:, 'categories'] = b.categories.str.replace("Used, Vintage & Consignment", "Used+Vintage+ & Consignment")
b.loc[:, 'categories'] = b.categories.str.replace("Beer, Wine & Spirits", "Beer+Wine & Spirits")
b.loc[:, 'categories'] = b.categories.str.replace("Wills, Trusts, & Probates", "Wills+Trusts+&Probates")

One hot encoding is for single label categorical variables. For multiple labels Scikit-learn's MultiLabelBinarizer is the way to go.

In [30]:
# split by comma
categories_series = b.categories.apply(lambda x: x.split(","))
# a  lambda function to convert the following data structure
# [list(['Tours', ' Breweries', ' Pizza', ' Restaurants', ' Food', ' Hotels & Travel'])
# to 
# ['Tours', ' Breweries', ' Pizza', ' Restaurants', ' Food', ' Hotels & Travel']
take_out_list = lambda x: list(x)
# get 2D list
categories_2D_list = take_out_list(categories_series)
# remove leading and trailing white spaces
for alist in categories_2D_list:
    alist[:] = [category.strip() for category in alist]
#print(categories_2D_list[:3])
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
categories_binarized = mlb.fit_transform(categories_2D_list)
#print(categories_binarized.shape)
#print(mlb.classes_.shape)
# mlb.classes_ returns the unique categories found by the binarizer
categories_binarized_df = pd.DataFrame(categories_binarized, columns=list(mlb.classes_))
categories_binarized_df.sample(10)
Out[30]:
3D Printing ATV Rentals/Tours Acai Bowls Accessories Accountants Acne Treatment Active Life Acupuncture Addiction Medicine Adoption Services ... Wine Tasting Room Wine Tours Wineries Women's Clothing Workers Compensation Law Wraps Yelp Events Yoga Ziplining Zoos
119558 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
73526 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
159212 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
177933 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
162643 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
137603 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14041 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15438 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
175481 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
90684 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

10 rows × 1299 columns

In [31]:
mlb.classes_
Out[31]:
array(['3D Printing', 'ATV Rentals/Tours', 'Acai Bowls', ..., 'Yoga',
       'Ziplining', 'Zoos'], dtype=object)

Top 20 Business Categories

In [32]:
top_20_categories = categories_binarized_df.sum().sort_values(ascending=False)[:20]
sns.set(rc={'figure.figsize':(11,8)})
ax = sns.barplot(y=top_20_categories.index, x=top_20_categories.values, orient='h')
sns.despine()

Are these categories combined in groups of 3 or more?

In [33]:
# find the number of total busineses
total_business = categories_binarized_df.shape[0]
combinations_of_top20 = categories_binarized_df[top_20_categories.index].apply(lambda x: list(x.index[np.where(x != 0)]), axis=1)
# convert index to string to make it hashable
import json
# dumps converts list to string
combinations_of_top20 = combinations_of_top20.map(json.dumps)
value_counts_top20 = combinations_of_top20.value_counts()
value_counts_top20
count_g_2 = 0
for c in value_counts_top20.index:
    # count combinations of two or more categories
    # loads converts back to list
    if(len(json.loads(c)) > 2):
        count_g_2 += value_counts_top20.loc[c]
print(count_g_2)
print(np.round((count_g_2/total_business) * 100, 2), "%")
26718
14.21 %

Only 14% of all businesses have combined categories from the top-20 categories. This indicates that most of these are top level categories encompassing many businesses.

In [34]:
total_business
Out[34]:
188045

There are 1888045 businesses in the dataset.

Which of these top business categories have higher closure?

In [35]:
top_20_categories.index
Out[35]:
Index(['Restaurants', 'Shopping', 'Food', 'Beauty & Spas', 'Home Services',
       'Health & Medical', 'Local Services', 'Automotive', 'Nightlife', 'Bars',
       'Event Planning & Services', 'Active Life', 'Fashion', 'Coffee & Tea',
       'Sandwiches', 'Hair Salons', 'Fast Food', 'American (Traditional)',
       'Pizza', 'Home & Garden'],
      dtype='object')
In [36]:
# concatenate is_open info to the binaried categories data
open_close_df = pd.concat([categories_binarized_df[top_20_categories.index], b.is_open], axis=1)
# find all open businesses
open_df = open_close_df[open_close_df.is_open ==1]
# find all closed businesses
close_df = open_close_df[open_close_df.is_open ==0]
# sum across columns to find number of open business in each column(category)
open_close_count = (pd.concat([open_df.sum(), close_df.sum()], axis=1))
# change the default column names
open_close_count.columns = ["Open", "Closed"]
#print(open_close_count)
# divide the counts by the total businesses in each category
open_close_count_normalized = open_close_count.div(open_df.sum() + close_df.sum(), axis=0).sort_values(by="Closed", ascending=False)[:-2]
sns.set(rc={'figure.figsize':(8,8)})
ax = sns.pointplot(x=open_close_count_normalized.Closed, y=open_close_count_normalized.index, orient='h')
sns.despine()

Those businesses that are at the end of the top 20 list are on the top on this graph and vice versa. Fast-food and Coffee/Tea have the highest closure rate though they are limited in number compared to the other businesses. Shopping centers and Health and Medical businesses are in large quantities but they are least likely to be closed.

Number of Reviews: Restaurants vs Shopping

In [37]:
categories_binarized_df["business_id"] = b.business_id.values
categories_binarized_df["is_open"] = b.is_open.values

restaurants_df =  categories_binarized_df[categories_binarized_df.Restaurants == 1]
shopping_df = categories_binarized_df[categories_binarized_df.Shopping == 1]

total_reviews = 5996996

# # the code below takes time. The results are saved to file and loaded in the next cell

# csize = 1e6
# review_reader = pd.read_json("yelp_academic_dataset_review.json", lines=True, chunksize=csize)


# num_restaurant_reviews = 0
# num_shopping_reviews = 0
# total_reviews = 0 
# for i, r in enumerate(review_reader):
#     total_reviews += r.shape[0]
#     print("working on chunk ",i," shape =", r.shape[0])
#     # count number of restaurant reviews found in chunk r
#     num_restaurant_reviews += pd.merge(restaurants_df, r, on=["business_id"]).shape[0]
#     # count number of shopping reviews found in chunk r
#     num_shopping_reviews += pd.merge(shopping_df, r, on=["business_id"]).shape[0]

    
# pickle.dump(num_restaurant_reviews, open("yelp-dataset/num_restaurant_reviews.sav", 'wb'))
# pickle.dump(num_shopping_reviews,open("yelp-dataset/num_shopping_reviews.sav", 'wb'))
In [38]:
num_restaurant_reviews = pickle.load(open("yelp-dataset/num_restaurant_reviews.sav", 'rb'))
num_shopping_reviews = pickle.load(open("yelp-dataset/num_shopping_reviews.sav", 'rb'))

print("Percent of Restaurant Businesses: {:.2f}".format(restaurants_df.shape[0] / total_business))
print("Percent of Shopping Businesses: {:.2f}".format(shopping_df.shape[0] / total_business))
print()
print("Percent of Restaurant Reviews: {:.2f}".format(num_restaurant_reviews / total_reviews))
print("Percent of Shopping Reviews: {:.2f}".format(num_shopping_reviews / total_reviews))
Percent of Restaurant Businesses: 0.30
Percent of Shopping Businesses: 0.16

Percent of Restaurant Reviews: 0.61
Percent of Shopping Reviews: 0.07

There are a lot of restaurants on Yelp. Second top are shopping businesses. When you see the number of reviews restaurants take more share than their number. Thus, restaurants are highly reviewed.

Reviews by Year

In [39]:
#The following code takes time. Saved data is loaded in the next cell

# csize=1e6
# review_reader = pd.read_json("yelp_academic_dataset_review.json", lines=True, chunksize=csize)
# total_reviews = 0 
# restaurant_open_review_years = []
# restaurant_closed_review_years = []
# shopping_open_review_years = []
# shopping_closed_review_years = []

# for i, r in enumerate(review_reader):
#     total_reviews += r.shape[0]
#     print("working on chunk ",i," shape =", r.shape[0])
    
#     restaurant_reviewsR = pd.merge(restaurants_df, r, on=["business_id"])
    
#     restaurant_reviews_closed = restaurant_reviewsR[restaurant_reviewsR.is_open == 0]
#     restaurant_reviews_open = restaurant_reviewsR[restaurant_reviewsR.is_open == 1]
    
    
#     restaurant_closed_review_years.append(pd.to_datetime(restaurant_reviews_closed.date.dt.year).values)
#     restaurant_open_review_years.append(pd.to_datetime(restaurant_reviews_open.date.dt.year).values)
 
#     shopping_restaurantR= pd.merge(shopping_df, r, on=["business_id"])
    
#     shopping_reviews_closed = shopping_restaurantR[shopping_restaurantR.is_open == 0]
#     shopping_reviews_open = shopping_restaurantR[shopping_restaurantR.is_open == 1]
    
#     shopping_closed_review_years.append(pd.to_datetime(shopping_reviews_closed.date.dt.year).values)
#     shopping_open_review_years.append(pd.to_datetime(shopping_reviews_open.date.dt.year).values)
    
# # save the lists to file
# pickle.dump(restaurant_open_review_years, open("yelp-dataset/restaurant_open_review_years.sav", 'wb'))
# pickle.dump(restaurant_closed_review_years, open("yelp-dataset/restaurant_closed_review_years.sav", 'wb'))
# pickle.dump(shopping_open_review_years, open("yelp-dataset/shopping_open_review_years.sav",'wb'))
# pickle.dump(shopping_closed_review_years, open("yelp-dataset/shopping_closed_review_years.sav", 'wb'))
In [40]:
# load lists from file
restaurant_open_review_years = pickle.load(open("yelp-dataset/restaurant_open_review_years.sav", 'rb'))
restaurant_closed_review_years = pickle.load(open("yelp-dataset/restaurant_closed_review_years.sav", 'rb'))
shopping_open_review_years = pickle.load(open("yelp-dataset/shopping_open_review_years.sav", 'rb'))
shopping_closed_review_years = pickle.load(open("yelp-dataset/shopping_closed_review_years.sav", 'rb'))

restaurant_open_review_years = np.concatenate(restaurant_open_review_years).ravel()
restaurant_closed_review_years = np.concatenate(restaurant_closed_review_years).ravel()
shopping_open_review_years = np.concatenate(shopping_open_review_years).ravel()
shopping_closed_review_years = np.concatenate(shopping_closed_review_years).ravel()
In [41]:
fig, (ax1,ax2) = plt.subplots(2,1,figsize=(16,6), sharex= True, sharey=False)

sns.set(rc={'figure.figsize':(4,4)})
sns.distplot(restaurant_open_review_years.astype(int), kde=False, label="open", ax=ax1)
sns.distplot(restaurant_closed_review_years.astype(int), kde=False, label = "closed", ax=ax1)
ax1.legend()
ax1.set_ylabel("Count")
ax2.set_xlabel("Year")
ax1.set_title("Restaurants Review")

sns.distplot(shopping_open_review_years.astype(int), kde=False, label="open", ax=ax2)
sns.distplot(shopping_closed_review_years.astype(int), kde=False, label = "closed", ax=ax2)
ax2.set_title("Shopping Review")
ax2.set_xlabel("Year")
ax1.set_ylabel("Count")
sns.despine()
ax2.legend()
plt.show()

The information on when a restaurant is closed is not provided by Yelp. We can only approximate the rate of closure based on the distrubiton of reviews. The graph shows that restaurants started recieving reviews on Yelp since 2004 and the reviews continued to increase. The year 2017 is the peak year for open restaurants. Many restaurants that were reviewed in 2014/15 are closed. We can not be sure whether they were closed in the same year or later.

Many shopping businesses that were reviewed in 2014-15 are closed by now.

Approxmating Restaurants Age

In [42]:
#The following code takes 15 minutes. Saved data is loaded in the next cell

# csize = 1e5
# review_reader = pd.read_json("yelp_academic_dataset_review.json", lines=True, chunksize=csize)
# total_reviews = 0 
# #restaurant_open_last_review_year = []
# #hopping_open_last_review_year = 
# #hopping_closed_last_review_year = []
# for i, r in enumerate(review_reader):
#     with warnings.catch_warnings():
#         warnings.simplefilter("ignore")
#         print("working on chunk ",i," shape =", r.shape[0])

#         restaurant_reviewsR = pd.merge(restaurants_df, r, on=["business_id"])

#         restaurant_reviewsR["review_year"] = pd.to_datetime(restaurant_reviewsR.date.dt.year)

#         if(i == 0):
#             restaurants_open_year = restaurant_reviewsR[["business_id", "review_year"]].groupby("business_id")["review_year"].min()
#             restaurants_closed_year = restaurant_reviewsR[["business_id", "review_year"]].groupby("business_id")["review_year"].max()

#         else:
#             restaurants_open_year = pd.concat([restaurants_open_year, restaurant_reviewsR[["business_id", "review_year"]].groupby("business_id")["review_year"].min()])
#             restaurants_closed_year = pd.concat([restaurants_closed_year, restaurant_reviewsR[["business_id", "review_year"]].groupby("business_id")["review_year"].max()])

# # save the dataframes to file
# pickle.dump(restaurants_open_year, open("yelp-dataset/restaurants_open_year.sav", 'wb'))
# pickle.dump(restaurants_closed_year, open("yelp-dataset/restaurants_closed_year.sav", 'wb'))
In [43]:
# open and close in the file names refer to earliest review year and most recent review years
# it is a naming mistake made when dumping the objects
earliest_review_year_all = pickle.load(open("yelp-dataset/restaurants_open_year.sav", 'rb'))
most_recent_review_year_all = pickle.load(open("yelp-dataset/restaurants_closed_year.sav", 'rb'))

earliest_review_year = earliest_review_year_all.groupby("business_id").min()
most_recent_review_year = most_recent_review_year_all.groupby("business_id").max()

restaurants_dates = pd.concat([earliest_review_year, most_recent_review_year], axis=1)
restaurants_dates.columns = ["earliest_review_year", "most_recent_review_year"]
# filter required attributes
restaurants_only = pd.merge(b[["business_id", "name", "state", "latitude", "longitude", "stars","review_count", "is_open"]], 
                            restaurants_dates, on="business_id")
restaurants_only.head()
Out[43]:
business_id name state latitude longitude stars review_count is_open earliest_review_year most_recent_review_year
0 Apn5Q_b6Nz61Tq4XzPdf9A Minhas Micro Brewery AB 51.091813 -114.031675 4.0 24 1 1970-01-01 00:00:00.000002013 1970-01-01 00:00:00.000002018
1 AjEbIBw6ZFfln7ePHha9PA CK'S BBQ & Catering NV 35.960734 -114.939821 4.5 3 0 1970-01-01 00:00:00.000002017 1970-01-01 00:00:00.000002017
2 O8S5hYJ1SMc8fA4QBtVujA La Bastringue QC 45.540503 -73.599300 4.0 5 0 1970-01-01 00:00:00.000002014 1970-01-01 00:00:00.000002016
3 6OuOZAok8ikONMS_T3EzXg Thai One On ON 43.712946 -79.632763 2.0 7 1 1970-01-01 00:00:00.000002012 1970-01-01 00:00:00.000002014
4 8-NRKkPY1UiFXW20WXKiXg Filiberto's Mexican Food AZ 33.448106 -112.341302 2.5 40 1 1970-01-01 00:00:00.000002008 1970-01-01 00:00:00.000002018
In [44]:
fig, (ax1,ax2) = plt.subplots(2,1,figsize=(12,8), sharex=True)
fig.subplots_adjust(hspace = 0.4, wspace = 0.4)

sns.set(rc={'figure.figsize':(4,4)})

open_earliest_years = restaurants_only[restaurants_only.is_open == True].earliest_review_year.values.astype(int)
open_most_recent_years = restaurants_only[restaurants_only.is_open == True].most_recent_review_year.values.astype(int)
closed_earliest_years = restaurants_only[restaurants_only.is_open == False].earliest_review_year.values.astype(int)
closed_most_recent_years = restaurants_only[restaurants_only.is_open == False].most_recent_review_year.values.astype(int)


sns.distplot(open_earliest_years, kde=False, label="newly opened", ax=ax1)
sns.distplot(open_most_recent_years, kde=False, label = "most recent", ax=ax1)

sns.distplot(closed_earliest_years, kde=False, label="newly opened", ax=ax2)
sns.distplot(closed_most_recent_years, kde=False, label = "out of business", ax=ax2)

ax1.set_title("Open Restaurants")
ax2.set_title("Closed Restaurants")

ax1.legend(loc="upper left")
ax2.legend(loc="upper left")
ax1.set_ylabel("Count")
ax2.set_ylabel("Count")

ax2.set_xlabel("Year")
plt.suptitle("Review Year")
sns.despine()

We are approximating the earliest year as the opening year of a restaurant. For currently closed restaurants, the most recent year is approximated as the out of business year.

Looking at the upper graph; almost all currently open restaurants were reviewed within the last 2 years. The distriubtion of earliest review year is left skewed. Since 2010 similar number of restaurants were opened

Looking at the bottom graph; number of newly opened restaurants is normally distributed for closed restaurants. Many restaurants that were opened in 2010 are closed by now. Restaurants that are recently opened are not closed by now. This indicates that new restaurants are not very likely to be closed and it takes some years. The difference betwen number new restaurants(red) and closed restaurants(blue) kept on increasing until 2013 and then it decreases.

They are many contributing factors that are outside the scope of this analysis.

In [45]:
len(set(restaurants_only.business_id))
Out[45]:
57173

There are 57173 restaurants on Yelp

Create Age Feature

In [46]:
# subtract earliest review year from 2018
restaurants_only["age"] = 2018 - restaurants_only.earliest_review_year.values.astype(int)
# select features
restaurants_only = restaurants_only[['business_id', 'name', "state", 'latitude', 'longitude', 'stars', 'review_count','age', 'is_open']]
restaurants_only.age.describe()
Out[46]:
count    57173.000000
mean         5.993266
std          3.145764
min          0.000000
25%          3.000000
50%          6.000000
75%          8.000000
max         14.000000
Name: age, dtype: float64

The maximum age is only 14 years because Yelp started in 2004

Class Imbalance: Restaurants

In [47]:
sns.set(rc={'figure.figsize':(8,6)})
sns.countplot(restaurants_only.is_open)
plt.title("Imbalanced Classes")
plt.xticks([0,1], ("Closed", "Open"))
plt.xlabel("Class")
plt.ylabel("Frequency")
sns.despine()
plt.show()
In [48]:
pd.value_counts(restaurants_only.is_open)/restaurants_only.shape[0]
Out[48]:
1    0.723104
0    0.276896
Name: is_open, dtype: float64
In [49]:
pd.value_counts(restaurants_only.is_open)
Out[49]:
1    41342
0    15831
Name: is_open, dtype: int64

Open and Closed Restaurants are not as imbalanced as they are for all busineses. Out of all businesses 83% of them are open but among all restaurants 72% of them are open.

In [50]:
restaurants_only.head()
Out[50]:
business_id name state latitude longitude stars review_count age is_open
0 Apn5Q_b6Nz61Tq4XzPdf9A Minhas Micro Brewery AB 51.091813 -114.031675 4.0 24 5 1
1 AjEbIBw6ZFfln7ePHha9PA CK'S BBQ & Catering NV 35.960734 -114.939821 4.5 3 1 0
2 O8S5hYJ1SMc8fA4QBtVujA La Bastringue QC 45.540503 -73.599300 4.0 5 4 0
3 6OuOZAok8ikONMS_T3EzXg Thai One On ON 43.712946 -79.632763 2.0 7 6 1
4 8-NRKkPY1UiFXW20WXKiXg Filiberto's Mexican Food AZ 33.448106 -112.341302 2.5 40 10 1

Outliers

In [51]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(8,6), sharey=True)
# making spaces between subplots
#fig.subplots_adjust(hspace = 0.2, wspace = 0.2)
ax1.set_xlabel("Num of Reviews")
ax2.set_xlabel("Num of Reviews")
ax1.set_ylabel("Count")
ax2.set_ylabel("Count")
sns.kdeplot(restaurants_only.review_count.sort_values(), ax=ax1)
sns.kdeplot(restaurants_only.review_count.sort_values(), ax=ax2)
ax2.set_xlim(0, 2000)
plt.show()
In [52]:
sns.boxplot(restaurants_only.review_count.sort_values())
plt.show()
In [53]:
from scipy import stats
z = np.abs(stats.zscore(restaurants_only.review_count))
threshold = 3
outlier_businesses = list(np.where(z > 3)[0])
print("Number of restaurants: ", restaurants_only.shape[0])
print("Number of outliers: ", len(outlier_businesses))
Number of restaurants:  57173
Number of outliers:  786

Removing 786 businesses will not harm because there is plenty of data.

In [54]:
business_ids = restaurants_only.iloc[outlier_businesses, 0].to_frame()
pd.merge(business_ids, b, on=["business_id"])["is_open"].value_counts()
Out[54]:
1    741
0     45
Name: is_open, dtype: int64
In [55]:
print(round(45/(741 + 45) * 100), "%")
6 %

6% of the outliers are closed. Compared to 27% closed , removing them will not harm the data.

In [56]:
pd.merge(business_ids, b, on=["business_id"])[["name", "state", "review_count", "address"]].sample(10)
Out[56]:
name state review_count address
449 Snooze an Am Eatery AZ 649 310 N Gilbert Rd
61 Soyo Korean Barstaurant NV 876 7775 S Rainbow Blvd, Ste 105
320 Margaritaville NV 1401 3555 Las Vegas Blvd S
65 Mr Mamas NV 2064 5693 S Jones Blvd, Ste 106
768 Satay Thai Bistro & Bar NV 571 3900 Paradise Rd, Ste N
432 Coconut's Fish Cafe AZ 776 16640 N Scottsdale Rd
702 Hot N Juicy Crawfish NV 726 3863 Spring Mountain Rd
654 The Stand AZ 870 3538 E Indian School Rd
491 Mon Ami Gabi NV 7968 3655 Las Vegas Blvd S
425 Bachi Burger NV 3250 470 E Windmill Ln, Ste 100

These must be very popular places because they are highly reviewed. They may create problems during text processing because a lot of words will be used in their reviews.

In [57]:
# check shape of dataframe before removal
restaurants_only.shape
Out[57]:
(57173, 9)
In [58]:
# remove outlier businesses
restaurants_only = restaurants_only.drop(outlier_businesses)
In [59]:
# check shape again
restaurants_only.shape
Out[59]:
(56387, 9)
In [60]:
# Look 786 indices are missing from the data
pd.RangeIndex(stop=57173).difference(restaurants_only.index).shape
Out[60]:
(786,)
In [61]:
restaurants_only = restaurants_only.reset_index(drop=True)
In [62]:
# now index is reset correctly
pd.RangeIndex(stop=57173 - 786).difference(restaurants_only.index).shape
Out[62]:
(0,)
In [63]:
restaurants_only.is_open.value_counts()
Out[63]:
1    40601
0    15786
Name: is_open, dtype: int64
In [64]:
# !python -m spacy download en
# import nltk
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')

Visualizing Restaurants

In [65]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)
In [66]:
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()

data = [
    go.Scattermapbox(lat=restaurants_only.latitude, lon=restaurants_only.longitude,
    mode='markers',
    text=restaurants_only.name)
]

layout = go.Layout(
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken="pk.eyJ1IjoicHJpeWF0aGFyc2FuIiwiYSI6ImNqbGRyMGQ5YTBhcmkzcXF6YWZldnVvZXoifQ.sN7gyyHTIq1BSfHQRBZdHA",
        bearing=0,
        center=dict(
            lat=38.92,
            lon=-77.07
        ),
        pitch=0,
        zoom=1
    ),
     width=900,
    height=600, 
    title = "World"
)

fig = dict(data=data, layout=layout)
iplot(fig, filename='Multiple Mapbox')

The restaurants are from all over the world. North America, South America, Europe, Asia and Antartica are included

Reverse Geocoding

Yelp dataset do not contain country data, but state is avaialble. US and Canda states are popular and well known. For the other countries, I check the longitude and latitude data to determine the country using google api

In [67]:
restaurants_only.state.unique()
Out[67]:
array(['AB', 'NV', 'QC', 'ON', 'AZ', 'OH', 'IL', 'WI', 'PA', 'NC', 'BY',
       'SC', 'C', 'XGM', 'IN', 'RP', 'NYK', 'ST', 'NI', 'CO', 'HE', 'VA',
       'RCC', '01', 'NY', 'NW', '4', 'OR', '10', 'CA', 'NLK', '45', 'G',
       'PO', 'B', 'VS', 'WAR', 'MO', 'HU', 'M', 'O', '6', 'FL', 'XMS',
       'AG', 'SG', 'WHT', 'V', 'BC', 'SP', '11'], dtype=object)
In [68]:
len(restaurants_only)
Out[68]:
56387
In [69]:
# us_states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
#           "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
#           "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
#           "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
#           "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

# canada_states = ['ON', 'QC', 'NS', 'NB', 'MB', 'BC', 'PE', 'SK', 'AB', 'NL']
# from geopy.geocoders import GoogleV3
# # registered the following google maps geoencoding api
# apikeys = ["AIzaSyDta3bVjrSokrV2xc2MaQvN-7JFv4BkAYg", "AIzaSyCxImBiEldxwzgW10PhfXVW1A_OSZqNecI"]
# # use it by alternating randomly
# geoencoder = GoogleV3(api_key=np.random.choice(apikeys))
# # a function to find country 
# def find_country(row):
#     # http://en.wikipedia.org/wiki/Extreme_points_of_the_United_States#Westernmost
#     top = 49.3457868 # north lat
#     left = -124.7844079 # west long
#     right = -66.9513812 # east long
#     bottom =  24.7433195 # south lat
#     # some states have same abberviation like USA. The check below will help.
#     inside_usa = bottom <= row['latitude'] <= top and left <= row['longitude'] <= right
#     if(inside_usa and row.state.strip() in us_states):
#         return "USA"
#     elif(row.state.strip() in canada_states):
#         return "Canada"
#     else:
#         latlon = str(row['latitude']) + ", " + str(row['longitude'])
#         location = geoencoder.reverse(latlon, exactly_one=True)
#         if location is not None:
#             return location.address.split(",")[-1].strip()
#         else:
#             return "Not-Found"
# # create new feature: country   
# countries_googleapi = pd.DataFrame(restaurants_only.apply(lambda row: find_country(row), axis=1))
# countries_googleapi.to_csv("countries_googleapi.csv", index=None)
In [70]:
# create country column
restaurants_only['country'] = pd.read_csv("countries_googleapi.csv")
In [71]:
# for testing
# from geopy.geocoders import GoogleV3
# #location = geolocator.reverse("43.748826, -79.347592", exactly_one=True)
# location.address.split(",")[-1].strip()
In [72]:
# value counts
restaurants_only.country.value_counts()
Out[72]:
USA                 33408
Canada              22617
Germany               188
UK                     88
Argentina              25
Italy                  23
Norway                 11
Ireland                 5
Chile                   3
Spain                   3
Sweden                  3
Austria                 2
United Kingdom          2
China                   2
France                  1
Portugal                1
Not-Found               1
Bermuda                 1
Antarctica              1
Belgium                 1
Singapore 098270        1
Name: country, dtype: int64

The documentation of Yelp dataset claims that the data is from 11 meteropolitan areas in the US and Canada. But the data tells a different story. Some restaurants from other countries are included

In [73]:
restaurants_only[~restaurants_only.country.isin(["USA", "Canada"])].sample(5)
Out[73]:
business_id name state latitude longitude stars review_count age is_open country
37283 4ve4lRUFRLwbrIeIzw3NZA Gaststätte Zum Korken NI 53.603100 9.476220 3.5 10 7 0 Germany
19226 NNTm8w9x-Fdw9D16fw8FPg Restaurant zur Alten Brauerei ST 51.790514 12.952687 3.5 3 9 1 Germany
49576 L8psDqzHfAb3TblNY2LpDw Auberge du Dominicain 11 43.240779 2.103968 4.5 3 11 1 France
53224 IUGbmiCpZ3_mCMM7yswBWQ Suisse Restaurant CO 45.986745 9.260872 4.0 5 3 1 Italy
15409 IkSm-C64EQ8CvsVjNnpENQ Monks Haven Cafe NYK 54.487751 -0.612052 3.5 3 6 1 UK
In [74]:
# remove other countries
indices = restaurants_only[~restaurants_only.country.isin(["USA", "Canada"])].index
restaurants_only = restaurants_only.drop(indices).reset_index(drop=True)
In [75]:
# check value counts again
restaurants_only.country.value_counts()
Out[75]:
USA       33408
Canada    22617
Name: country, dtype: int64
In [76]:
len(restaurants_only)
Out[76]:
56025
In [77]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()

data = [
    go.Scattermapbox(lat=restaurants_only.latitude, lon=restaurants_only.longitude,
    mode='markers',
    text=restaurants_only.name +", " + restaurants_only.country)
]

layout = go.Layout(
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken="pk.eyJ1IjoicHJpeWF0aGFyc2FuIiwiYSI6ImNqbGRyMGQ5YTBhcmkzcXF6YWZldnVvZXoifQ.sN7gyyHTIq1BSfHQRBZdHA",
        bearing=0,
        center=dict(
            lat=38.92,
            lon=-77.07
        ),
        pitch=0,
        zoom=1
    ),
     width=900,
    height=600, 
    title = "World"
)

fig = dict(data=data, layout=layout)
iplot(fig, filename='Multiple Mapbox')

Clustering by distance will successfully remove the remaining erroronus coordinates

In [78]:
# check any missing indices
restaurants_only.index
Out[78]:
RangeIndex(start=0, stop=56025, step=1)

Clustering Data

DBScan has two important hyper parameters, epislon and min_samples. epislon refers to maximum distance between two data points to be considered as neighbors. As eps increases the number of clusters decrease because some clusters will be merged. min_samples is the miniumum number of data points in a cluster. If it is small, It will have a lot of clusters.

They are three kind of points in core, boundary and noises. Core points are surrounding by min_samples atleast. Boundary points do not have enough neigbors to be a core point. They are directly reachable by atleast one core point. Other core points can reach indirectly through other core points.

Reachability is not a symmetric relation since, by definition, no point may be reachable from a non-core point, regardless of distance (so a non-core point may be reachable, but nothing can be reached from it). Therefore, a further notion of connectedness is needed to formally define the extent of the clusters found by DBSCAN. Two points p and q are density-connected if there is a point o such that both p and q are reachable from o. Density-connectedness is symmetric.

A cluster then satisfies two properties:

  1. All points within the cluster are mutually density-connected.
  2. If a point is density-reachable from any point of the cluster, it is part of the cluster as well.

HDBSCAN has two important hyper parameters: min_samples and min_cluster_size. It doesn't need maximum distance instead it takes min_cluster_size. The cluster_size is analoguous to the area taken by the cluster.

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In [79]:
coords = restaurants_only[['latitude', 'longitude']]
In [80]:
from sklearn.cluster import DBSCAN
# doing dbscan
kms_per_radian = 6371.0088
miles_per_radian = 3959
# convert one mile to radians
epsilon = 1 / miles_per_radian
# haversine distance requires coordinates to be in radians
db_clusterer = DBSCAN(eps=epsilon, min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
db_cluster_labels = db_clusterer.labels_
db_num_clusters = len(set(db_cluster_labels))
db_clusters = pd.Series([coords[db_cluster_labels == n] for n in range(db_num_clusters)])
print('Number of clusters: {}'.format(db_num_clusters))
Number of clusters: 297

HDBSCAN don't take maximum distance. It requires min_cluster_size and min_samples per cluster. I can use it with haversine distance. The biggest advantage of HDBSCAN is it works when data has varying densities.

In [81]:
import hdbscan
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    # 30: 80/20 split 6, 24
    hdb_clusterer = hdbscan.HDBSCAN(algorithm='best', metric='haversine', min_cluster_size=30, min_samples=5)
    hdb_clusterer.fit(np.radians(coords))
    hdb_cluster_labels = hdb_clusterer.labels_
    hdb_num_clusters = len(set(hdb_cluster_labels))
    hdb_clusters = pd.Series([coords[hdb_cluster_labels == n] for n in range(hdb_num_clusters)])
    print('Number of clusters: {}'.format(hdb_num_clusters))
Number of clusters: 645

DBSCAN/ HDBSCAN

In [82]:
print(len(db_clusterer.core_sample_indices_))
print(len(restaurants_only) - len(db_clusterer.core_sample_indices_))
54798
1227

Only 1227 restaurants have less than 5 neighbors within one mile

In [83]:
# core indices
restaurants_only.loc[db_clusterer.core_sample_indices_].sample(5)
Out[83]:
business_id name state latitude longitude stars review_count age is_open country
4467 OoKNxTMu5YAaNgQKQ4SrzA Desert Roots Kitchen AZ 33.425974 -111.940346 4.5 239 6 1 USA
23525 vqgfQqPiLw6ghdVgpZ_vJQ Jack Astor's Bar & Grill ON 43.656586 -79.380074 3.0 196 10 1 Canada
8630 3BGzRYh6GQbZz3LGAGb-sQ Gallagher's AZ 33.378930 -112.010982 3.0 212 10 1 USA
47233 whBcqr34XSx2NmWAYRp1Tg Spirit's Bar & Grill AZ 33.415098 -111.630139 1.5 7 7 0 USA
39309 w8ZxnScDWMquPfh6oHt0og Subway OH 41.369862 -81.803116 1.0 3 2 1 USA
In [84]:
# all are given labels
assert(len(restaurants_only) == len(db_cluster_labels))
In [85]:
# add cluster labels to the dataframe
restaurants_only['cluster_labels_db'] = db_clusterer.labels_
restaurants_only['cluster_labels_hdb'] = hdb_clusterer.labels_

# remove -1 
restaurants_only_db = restaurants_only[restaurants_only.cluster_labels_db != -1].reset_index(drop=True)

restaurants_only_hdb = restaurants_only[restaurants_only.cluster_labels_hdb != -1].reset_index(drop=True)
In [86]:
restaurants_only_db.index
Out[86]:
RangeIndex(start=0, stop=55076, step=1)
In [87]:
restaurants_only_hdb.index
Out[87]:
RangeIndex(start=0, stop=43205, step=1)

HDBSCAN detected more outliers

In [88]:
# draw the map

def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()


layout = go.Layout(
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken="pk.eyJ1IjoicHJpeWF0aGFyc2FuIiwiYSI6ImNqbGRyMGQ5YTBhcmkzcXF6YWZldnVvZXoifQ.sN7gyyHTIq1BSfHQRBZdHA",
        bearing=0,
        center=dict(
            lat=38.92,
            lon=-77.07
        ),
        pitch=0,
        zoom=1
    ),
     width=900,
    height=600, 
    title = "World"
)

data1 = [
    go.Scattermapbox(lat=restaurants_only_db.latitude, lon=restaurants_only_db.longitude, 
    mode='markers',
    text=restaurants_only_db.name +", " + restaurants_only_db.country + ", " + restaurants_only_db.state)
]

data2 = [
    go.Scattermapbox(lat=restaurants_only_hdb.latitude, lon=restaurants_only_hdb.longitude, 
    mode='markers',
    text=restaurants_only_db.name +", " + restaurants_only_db.country + ", " + restaurants_only_db.state)
]

layout.title = "DBSCAN"
fig1 = dict(data=data1, layout=layout)
iplot(fig1, filename='Multiple Mapbox1')

layout.title = "HDBSCAN"
fig2 = dict(data=data2, layout=layout)
iplot(fig2, filename='Multiple Mapbox2')

Analyzing DBSCAN Clustering

In [89]:
# testing delete
# rr = np.array([[1,2,3], [4,5,7]])
# np.delete(rr, (0), axis=0)
In [90]:
# showing num_of_samples in cluster
from collections import Counter
# sort labels by num of samples in ascending order
sorted_labels = np.array(sorted(Counter(db_cluster_labels).items(), key=lambda kv: kv[1]))
neg1loc = np.where(sorted_labels[:,0] == -1)[0][0]
# remove -1 count
sorted_labels = np.delete(sorted_labels, (neg1loc), axis=0)
sns.boxplot(sorted_labels[:,1])
plt.title("Showing outliers: All 296 clusters")
plt.show()
In [91]:
# 241 clusters are non-outliers out of 296 clusters
sns.boxplot(sorted_labels[:241,1])
plt.title("Smallest 241 Clusters")
plt.show()
In [92]:
sns.distplot(sorted_labels[:241,1])
plt.title("Distribution-Bottom 241")
plt.show()
In [93]:
# least num of samples
sorted_labels[:10, 1]
Out[93]:
array([3, 5, 5, 5, 5, 5, 5, 5, 5, 5])
In [94]:
# highest num of samples
sorted_labels[-10:, 1][::-1]
Out[94]:
array([13209,  9322,  6729,  4529,  2844,  2833,  2629,  1919,  1284,
         534])
In [95]:
sns.barplot(np.arange(len(restaurants_only_db.state.unique())), restaurants_only_db.state.value_counts().values)
Out[95]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f20db2d4080>
In [96]:
restaurants_only_db.state.unique()
Out[96]:
array(['AB', 'QC', 'ON', 'AZ', 'NV', 'OH', 'PA', 'NC', 'WI', 'SC', 'IL',
       'IN', 'NY', 'OR', 'BC'], dtype=object)

Analyzing HDBSCAN

In [97]:
len(set(hdb_clusterer.labels_))
Out[97]:
645
In [98]:
# showing num_of_samples in cluster
from collections import Counter
# sort labels by num of samples in ascending order
sorted_labels = np.array(sorted(Counter(hdb_cluster_labels).items(), key=lambda kv: kv[1]))
neg1loc = np.where(sorted_labels[:,0] == -1)[0][0]
# remove -1 count
sorted_labels = np.delete(sorted_labels, (neg1loc), axis=0)
sns.boxplot(sorted_labels[:,1])
plt.title("Showing outliers: All 654 clusters")
plt.show()

There are no extreme outliers. Maximum num of samples is only 400

In [99]:
# 606 out of 644 clusters are non-outliers
sns.boxplot(sorted_labels[:606,1])
plt.title("Smallest 606 Clusters")
plt.show()
In [100]:
sns.distplot(sorted_labels[:609,1])
plt.title("Distribution-Bottom 613")
plt.show()

HDBSCAN has num of samples above 80 unlike DBSCAN.

In [101]:
# least num of samples
sorted_labels[:10, 1]
Out[101]:
array([30, 30, 30, 30, 30, 30, 30, 30, 30, 30])
In [102]:
# highest num of samples
sorted_labels[-10:, -1][::-1]
Out[102]:
array([394, 359, 342, 331, 330, 323, 314, 224, 203, 192])

Plotting the Clusters

In [103]:
# get center most point from a set of coordinates
from shapely.geometry import MultiPoint
def get_centermost_point(coordinates):
    centroid = (MultiPoint(coordinates).centroid.x, MultiPoint(coordinates).centroid.y)
    centermost_point = min(coordinates, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)

DBSCAN

In [396]:
# A function to show clusters by state DBSCAN
def plot_cluser_by_state(state_code):
    state = restaurants_only_db[restaurants_only_db.state == state_code]
    # get centermost point
    centermost_point = get_centermost_point(state[['latitude', 'longitude']].values)

    # plotting the clusters in a map
    import random
    plotly_data = []
    for label in set(state.cluster_labels_db):
        r = int(random.random() * 256)
        g = int(random.random() * 256)
        b = int(random.random() * 256)
        the_cluster = state[state.cluster_labels_db == label ]
        mapbox = go.Scattermapbox(
                lat= the_cluster.latitude ,
                lon= the_cluster.longitude,
                mode='markers',
                fillcolor = 'rgba(' + str(r) + ',' + str(g) + ',' + str(b) + ',' + str(0.5) + ')',

                marker=dict(
                    size= 8
                    #color = 'rgba(' + str(r) + ',' + str(g) + ',' + str(b) + ',' + str(0.5) + ')',

                    ),
               name ='C ' + str(label) + "("+ str(len(the_cluster)) +")",
               text=the_cluster.name
              )
        plotly_data.append(mapbox)

    data = plotly_data
    layout = go.Layout(autosize=False, font=dict(color='black'), 
                       mapbox= dict(accesstoken="pk.eyJ1IjoidGdhamF2YSIsImEiOiJjanNqeDJuZHUxZGsxNDlxZ25mYWlucXlkIn0.0kR1orNq4Gv32CTmQKKTJw",
                                    style='dark',
                                    bearing=3,
                                    pitch=5,
                                    zoom=10,
                                    center= dict(
                                            lat= centermost_point[0],
                                            lon= centermost_point[1])
                                ),

                        width=900,
                        height=600, title = "One Mile Clusters - " + state_code
    )
    fig = dict(data=data, layout=layout)
    iplot(fig)
all_states = restaurants_only_db.state.unique()
for state in all_states[3:4]:
    plot_cluser_by_state(state)
In [105]:
# look at cluster 2 assignments. 
restaurants_only_db[restaurants_only_db.cluster_labels_db == 2].state.value_counts()
Out[105]:
ON    13207
AB        1
BC        1
Name: state, dtype: int64
In [106]:
restaurants_only_db[(restaurants_only_db.state == "AB") & (restaurants_only_db.cluster_labels_db == 2)]
Out[106]:
business_id name state latitude longitude stars review_count age is_open country cluster_labels_db cluster_labels_hdb
37260 Z6ho09TTe3Y9eXwX2n5vJA Domino's Pizza AB 43.748826 -79.347592 1.5 3 2 1 Canada 2 317
In [107]:
restaurants_only_db[(restaurants_only_db.state == "BC") & (restaurants_only_db.cluster_labels_db == 2)]
Out[107]:
business_id name state latitude longitude stars review_count age is_open country cluster_labels_db cluster_labels_hdb
47389 72GGspuzlRC4hhIgy-IFAw Subway BC 43.875795 -79.413535 1.5 3 3 1 Canada 2 295
In [108]:
b[b.business_id=="Z6ho09TTe3Y9eXwX2n5vJA"].address
Out[108]:
125480    1396 Don Mills Road, Suite 52
Name: address, dtype: object
In [109]:
b[b.business_id=="72GGspuzlRC4hhIgy-IFAw"].address
Out[109]:
161540    815 Major MacKenzie Drive E, Unit 14
Name: address, dtype: object

This addresses are in Ontario. It is a mistake in the data

first address

second address

In [110]:
# Domino's Pizza is core_point
37260 in db_clusterer.core_sample_indices_
Out[110]:
True
In [111]:
# A vectorized implemenation of geopy.distance.great_circle function
# Adopted by Tinsae 
# reference
# https://gist.github.com/rochacbruno/2883505
    
import math
# calculate the disance between a business and all other businesses
def distance_c(many_points, one_point):
    # convert the values to float
    lat1 = many_points[:,0]
    lon1 = many_points[:,1]
    lat2 = one_point[0]
    lon2 = one_point[1]
    radius = 6371 # km
    #radius = 3959 # miles
    dlat = np.radians(lat2 - lat1)
    dlon = np.radians(lon2 - lon1)
    a = np.sin(dlat/2) * np.sin(dlat/2) + np.cos(np.radians(lat1)) \
        * np.cos(np.radians(lat2)) * np.sin(dlon/2) * np.sin(dlon/2)
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    d = radius * c

    return d
In [112]:
# from geopy.distance import great_circle
# sushi8 = (43.748826, -79.347592)
# for i, j in restaurants_only.loc[db.core_sample_indices_][['latitude', 'longitude']].iterrows():
#     print(great_circle((j.latitude, j.longitude), sushi8))
In [113]:
# find the nearest point to dominos in cluster 2
core_samples = restaurants_only_db.loc[db_clusterer.core_sample_indices_][['latitude', 'longitude']].values
dominos = [43.748826, -79.347592]
sorted_by_closeness = np.argsort(distance_c(core_samples, dominos))
# find that restaurant
print(core_samples[sorted_by_closeness[0]])
# the closest except itself: use 1 instead of 0
the_closest_core = sorted_by_closeness[1]
restaurants_only_db.loc[db_clusterer.core_sample_indices_[the_closest_core]]
[ 43.7488263 -79.3475916]
Out[113]:
business_id           dhpzraAYdlZEomsaZjt5yg
name                            Island Foods
state                                     ON
latitude                             43.7459
longitude                           -79.3463
stars                                    3.5
review_count                              53
age                                       10
is_open                                    1
country                               Canada
cluster_labels_db                          2
cluster_labels_hdb                       317
Name: 1451, dtype: object
In [114]:
# here are the top 10 closest to dominos in cluster 2
from geopy.distance import great_circle
print("{:35s}{:10s}{:10s}".format("Restaurant", "State","Distance(Miles)"))
print()
for i in range(0,11):
    current = sorted_by_closeness[i]
    the_indices = db_clusterer.core_sample_indices_[current]
    resta = restaurants_only_db.loc[the_indices]
    latlon = (resta.latitude, resta.longitude)
    
    print("{:35s}{:10s}{:5.5f}".format(resta['name'], resta.state, great_circle(dominos, latlon).miles))
Restaurant                         State     Distance(Miles)

Domino's Pizza                     AB        0.00003
Island Foods                       ON        0.21089
Gonoe Sushi                        ON        0.21089
Matsuda Japanese Cuisine           ON        0.23986
Tako Sushi Japanese Restaurant     ON        0.24256
Baretto Caffe                      ON        0.28535
Galleria Supermarket               ON        0.35035
Spoon And Fork Plus                ON        0.38130
Pizzaiolo                          ON        0.38130
Five Guys                          ON        0.38130
WonderPho                          ON        0.38130
In [115]:
subway = [43.875795, -79.413535]
sorted_by_closeness = np.argsort(distance_c(core_samples, subway))
In [116]:
# here are the top 10 closest to subway in cluster2
from geopy.distance import great_circle
print("{:35s}{:10s}{:10s}".format("Restaurant", "State","Distance(Miles)"))
print()
for i in range(0,11):
    current = sorted_by_closeness[i]
    the_indices = db_clusterer.core_sample_indices_[current]
    resta = restaurants_only_db.loc[the_indices]
    latlon = (resta.latitude, resta.longitude)
    
    print("{:35s}{:10s}{:5.5f}".format(resta['name'], resta.state, great_circle(subway, latlon).miles))
Restaurant                         State     Distance(Miles)

Subway                             BC        0.00002
Bayview Kitchen                    ON        0.00427
Restoran Malaysia                  ON        0.00808
Fred Kabob                         ON        0.00965
The Owl of Minerva                 ON        0.01276
McDonald's                         ON        0.05886
Wendy's Richmond Hill              ON        0.06092
Touhenboku Ramen                   ON        0.08643
Pita Pit                           ON        0.09737
Thai Basil                         ON        0.10649
Bombay's Chutney                   ON        0.10844

Both Subway and Dominos are in Ontario as I discovered in a previous cell.

In [397]:
# A function to show clusters by state HDBSCAN
def plot_cluser_by_state(state_code):
    state = restaurants_only_hdb[restaurants_only_hdb.state == state_code]
    # get centermost point
    centermost_point = get_centermost_point(state[['latitude', 'longitude']].values)

    # plotting the clusters in a map
    import random
    plotly_data = []
    for label in set(state.cluster_labels_hdb):
        r = int(random.random() * 256)
        g = int(random.random() * 256)
        b = int(random.random() * 256)
        the_cluster = state[state.cluster_labels_hdb == label ]
        mapbox = go.Scattermapbox(
                lat= the_cluster.latitude ,
                lon= the_cluster.longitude,
                mode='markers',
                fillcolor = 'rgba(' + str(r) + ',' + str(g) + ',' + str(b) + ',' + str(0.5) + ')',

                marker=dict(
                    size= 8
                    #color = 'rgba(' + str(r) + ',' + str(g) + ',' + str(b) + ',' + str(0.5) + ')',

                    ),
               name ='C ' + str(label) + "("+ str(len(the_cluster)) +")",
               text=the_cluster.name
              )
        plotly_data.append(mapbox)

    data = plotly_data
    layout = go.Layout(autosize=False, font=dict(color='black'), 
                       mapbox= dict(accesstoken="pk.eyJ1IjoidGdhamF2YSIsImEiOiJjanNqeDJuZHUxZGsxNDlxZ25mYWlucXlkIn0.0kR1orNq4Gv32CTmQKKTJw",
                                    style='dark',
                                    bearing=3,
                                    pitch=5,
                                    zoom=10,
                                    center= dict(
                                            lat= centermost_point[0],
                                            lon= centermost_point[1])
                                ),

                        width=900,
                        height=600, title = "One Mile Clusters - " + state_code
    )
    fig = dict(data=data, layout=layout)
    iplot(fig)
all_states = restaurants_only_hdb.state.unique()
for state in all_states[3:6]:
    plot_cluser_by_state(state)
In [118]:
sns.set(rc={'figure.figsize':(8,6)})
sns.countplot(restaurants_only_hdb.is_open)
plt.title("Imbalanced Classes")
plt.xticks([0,1], ("Closed", "Open"))
plt.xlabel("Class")
plt.ylabel("Frequency")
sns.despine()
plt.show()
In [119]:
restaurants_only_hdb.is_open.value_counts()
Out[119]:
1    30531
0    12674
Name: is_open, dtype: int64

Stratified Undersampling With Clusters

This project uses geospatial features for surrounding performance. In simple random undersampling all businesses have equal chance of being selected. This creates a sample where many restaurants have no near neighbour. Simple random undersampling don't consider similarities among datapoints.

I try to maximize the availablilty of surrounding restaurants within one mile of all sample restaurants. Here are the steps to follow

  • Step 1: Obtain cluster labels using all dataset
  • Step 2: Remove businesses that have no cluster
  • Step 3: Use latitude and longitude as X, cluster labels as y
  • Step 4: Determine probability of clusters: num_of_samples / len(all_restaurants)
  • Step 5: Create a dictionary where keys are cluster labels and values are round(probabilities * 15595)
  • Step 6:
          rus = RandomUnderSampler(sampling_strategy=the_dict)
          newX, newY = rus.fit_sample(X, y)

The cluster label is part of the dataframe. I do train test split stratified by the cluster label.

The first 3 steps are done in the previous cells

In [120]:
restaurants_only_hdb.head()
Out[120]:
business_id name state latitude longitude stars review_count age is_open country cluster_labels_db cluster_labels_hdb
0 O8S5hYJ1SMc8fA4QBtVujA La Bastringue QC 45.540503 -73.599300 4.0 5 4 0 Canada 1 567
1 6OuOZAok8ikONMS_T3EzXg Thai One On ON 43.712946 -79.632763 2.0 7 6 1 Canada 2 229
2 8-NRKkPY1UiFXW20WXKiXg Filiberto's Mexican Food AZ 33.448106 -112.341302 2.5 40 10 1 USA 3 102
3 KapTdGyGs7RK0c68Z6hhhg Sushi 8 ON 43.862484 -79.306960 1.5 12 2 0 Canada 2 364
4 tZnSodhPwNr4bzrwJ1CSbw Southern Accent Restaurant ON 43.664125 -79.411886 4.0 146 10 0 Canada 2 562

Undersample Majority Class

In [121]:
restaurants_only_hdb_open = restaurants_only_hdb[restaurants_only_hdb.is_open == 1] 
In [122]:
from imblearn.under_sampling import RandomUnderSampler
cluster_labels_unique = sorted(restaurants_only_hdb_open.cluster_labels_hdb.unique())
cluster_counts = restaurants_only_hdb_open.cluster_labels_hdb.value_counts().sort_index().values
num_of_open_restaurants = restaurants_only_hdb_open.shape[0]
# undersample majority class to be equal with the mninority class
undersampled_to = np.min(restaurants_only_hdb.is_open.value_counts().values)
desired_number = {label:int((count/num_of_open_restaurants) * undersampled_to) for label, count in zip(cluster_labels_unique, cluster_counts)}
sns.distplot(list(desired_number.values()))
plt.title("Strata Size Distribution")
plt.ylabel("Probability")
plt.xlabel("Strata Size")
sns.despine()
In [123]:
# rus = RandomUnderSampler(sampling_strategy=desired_number)
# X = restaurants_only_hdb_open.copy()
# y = restaurants_only_hdb_open.cluster_labels_hdb
# new_X, new_y = rus.fit_sample(X, y)
In [124]:
# open_busin_under = pd.DataFrame(new_X, columns=restaurants_only_hdb_open.columns)
# open_busin_under.head()
In [125]:
# restaurants_only_hdb_closed = restaurants_only_hdb[restaurants_only_hdb.is_open == 0]
# restaurants_undersampled = pd.concat([open_busin_under, restaurants_only_hdb_closed], ignore_index=True).reset_index(drop=True)
In [126]:
# assert(restaurants_undersampled.shape[0] == restaurants_only_hdb_closed.shape[0] + open_busin_under.shape[0])
In [127]:
# # save data for future runs
# restaurants_undersampled.to_csv("yelp-dataset/restaurants_undersampled.csv")
In [128]:
# load saved data
restaurants_undersampled = pd.read_csv("yelp-dataset/restaurants_undersampled.csv", index_col=0)
restaurants_undersampled.shape
Out[128]:
(25036, 12)
In [129]:
restaurants_undersampled.head()
Out[129]:
business_id name state latitude longitude stars review_count age is_open country cluster_labels_db cluster_labels_hdb
0 zowCVyE7nHtU_iqlIKORBQ Athena's Grill AB 51.293028 -114.002084 3.0 5 7 1 Canada 180 0
1 C2p9SIPijF0owPVngiCqFw Vietnamese Western Cuisine AB 51.271791 -114.017611 3.5 7 4 1 Canada 180 0
2 v_wCPqghNL_Q9G1deTTcWA Smitty's Restaurant AB 51.286745 -113.998161 3.0 6 6 1 Canada 180 0
3 b7kpFwn_mBu0n76VFi_-hA Toad 'n' Turtle Pubhouse AB 51.269607 -113.995168 3.0 19 6 1 Canada 180 0
4 y-Jp4iSIMDJE4JzGBh7M-g Noodlebox AB 51.263739 -114.006040 4.5 10 5 1 Canada 180 0
In [130]:
restaurants_undersampled.state.value_counts()
Out[130]:
ON    6493
AZ    5035
NV    3448
OH    2268
QC    1989
NC    1880
PA    1730
AB    1129
WI     674
IL     256
SC     101
IN      33
Name: state, dtype: int64
In [131]:
restaurants_undersampled.review_count.describe()
Out[131]:
count    25036.000000
mean        47.188449
std         75.253652
min          3.000000
25%          7.000000
50%         18.000000
75%         51.000000
max        551.000000
Name: review_count, dtype: float64

Working on Yelp Reviews Data

In [132]:
import spacy
nlp = spacy.load("en", disable=['parser', 'ner'])
doc = nlp("She was dancing and walking in the garden. Flowers and plants are refreshing. I have a flower factory")
for token in doc:
    print(token.lemma_)
-PRON-
be
dance
and
walk
in
the
garden
.
flower
and
plant
be
refresh
.
-PRON-
have
a
flower
factory
In [133]:
# #using spacy lemmatizer as a class
# from spacy.lemmatizer import Lemmatizer
# from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
# lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
# print(nltk.pos_tag(["traumatizing"]))
# print(lemmatizer("hers", "P"))

# nltk classes used for tokenization and lemmatization
# from nltk import word_tokenize
# from nltk.stem import WordNetLemmatizer 

Custom Tokenizer-Lemmatizer

Scikit-learn's vectorizers don't convert words into their base form by default. The custom lemmatizer/tokenizer is passed as tokenizer for CountVectorizer and TfIdfVectorizer Objects.

I tried to use NLTK's word_tokenize and WordNetLemmatizer. NLTK methods work on a specified part of speech, the default is noun. It is not flexible to work for multiple part of speech.

Spacy is very flexible. When a spacy object is created, it will automatically know the part of speech of all tokens in the given document. It converts words into their base form depending on their POS tag. The only problem with spacy is its speed. Creating a spacy document involves tagging, parsing and named entity recognition. I disabled the steps that I don't need which increased its speed.

In [134]:
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
/home/tigial3535/anaconda3/lib/python3.7/runpy.py:193: UserWarning:

[W005] Doc object not parsed. This means displaCy won't be able to generate a dependency visualization for it. Make sure the Doc was processed with a model that supports dependency parsing, and not just a language class like `English()`. For more info, see the docs:
https://spacy.io/usage/models

She PRON was VERB dancing VERB and CCONJ walking VERB in ADP the DET garden. NOUN Flowers NOUN and CCONJ plants NOUN are VERB refreshing. VERB I PRON have VERB a DET flower NOUN factory NOUN
In [135]:
import spacy
nlp = spacy.load("en", disable=['parser', 'ner'])
# Use custom tokenizer and lemmatizer
def lemmatizer_tokenize(a_review):
    # remove extra space and new line
    a_review = ' '.join(a_review.split())
    a_review = nlp(a_review)
    words = [token.lemma_ for token in a_review
                if (
                    not token.is_digit and
                    not token.is_punct and 
                    not token.is_stop and 
                    token.lemma_ != "-PRON-"
                )
            ]
    return words
In [136]:
# testing lemmatizer
lemmatizer_tokenize("Dancing entertains people. The girl was dancing when the shooters entered the club")
Out[136]:
['dancing',
 'entertain',
 'people',
 'the',
 'girl',
 'dance',
 'shooter',
 'enter',
 'club']

The first "Dancing" is a noun. The second "dancing" is a verb. Spacy only converted the verb to its base form. Shooters is a noun, converted to shooter. "Entered" is a verb, converted to enter.

In [137]:
# #tokenizer pattern and replace practice
# myvect1 = CountVectorizer(stop_words = 'english')
# myvect2 = CountVectorizer(stop_words = 'english')

# somelist = pd.Series(["Tinse is great ጥሩ ˈdäNG person", "Ethiopia is my country!", "15Brown fox", "Quick", "13 it is nice"])
# somelist2 = pd.Series(["Tinse is great ጥሩ",  "däNG person", "fox"])

# somelist = somelist.str.replace(r'(?u)[^\w\s]+|[\d]+', '')
# somelist2 = somelist2.str.replace(r'(?u)[^\w\s]+|[\d]+', '')
# #print(somelist)
# transformed1 = myvect1.fit(somelist)
# transformed2 = myvect2.fit(somelist2)

# #print(myvect.vocabulary_)
# myvect1.get_feature_names()
# myvect2.get_feature_names()
# the_vocabulary = set(sorted(myvect1.get_feature_names() + myvect2.get_feature_names()))
# myvectx = TfidfVectorizer(vocabulary=the_vocabulary)
# myvectx.vocabulary
# sparse.csr_matrix([1,2,3,0])

1. Simple TF-IDF Features

Inverse document frequency is calculated on the whole dataset. Doing train/test split after tfidf vectorization leaks data from the train set to the test set. I vectorized train and test sets separately using the same vocabulary.

Train/Test Split

Do train/test split stratified by cluster labels

In [135]:
# X = restaurants_undersampled.copy()
# y = restaurants_undersampled.cluster_labels_hdb

# # stratified sampling
# train, test, _, _ = train_test_split(X, y, test_size=0.2, random_state=20, shuffle=True, stratify=y)
# train = train.reset_index(drop=True)
# test = test.reset_index(drop=True)

# # sort them
# train_tfidf = train.sort_values(by="business_id").reset_index(drop=True)
# test_tfidf = test.sort_values(by="business_id").reset_index(drop=True)



# print(len(train_tfidf))
# print(len(test_tfidf))

# train_tfidf.to_csv("yelp-dataset/train_tfidf")
# test_tfidf.to_csv("yelp-dataset/test_tfidf")
In [138]:
train_tfidf = pd.read_csv("yelp-dataset/train_tfidf", index_col=0)
test_tfidf = pd.read_csv("yelp-dataset/test_tfidf", index_col=0)
In [139]:
train_tfidf.head(5)
Out[139]:
business_id name state latitude longitude stars review_count age is_open country cluster_labels_db cluster_labels_hdb
0 --1UhMGODdWsrMastO9DZw The Spicy Amigos AB 51.049673 -114.079977 4.0 24 2 1 Canada 0 354
1 --FBCX-N37CMYDfs790Bnw The Bar At Bermuda & St. Rose NV 35.978679 -115.155057 4.0 125 9 1 USA -1 21
2 --S62v0QgkqQaVUhFnNHrw Denny's OH 41.539420 -81.455225 2.0 35 7 1 USA 35 201
3 --cZ6Hhc9F7VkKXxHMVZSQ Pio Pio NC 35.199853 -80.844820 4.0 333 10 1 USA 10 297
4 --g-a85VwrdZJNf0R95GcQ Kabab House AZ 33.641298 -112.006392 4.5 24 5 0 USA 4 394
In [140]:
test_tfidf.head(5)
Out[140]:
business_id name state latitude longitude stars review_count age is_open country cluster_labels_db cluster_labels_hdb
0 --SrzpvFLwP_YFwB_Cetow Keung Kee Restaurant ON 43.806750 -79.288858 3.5 44 7 0 Canada 2 411
1 -2C96suwzrE_cqI1U69cLA Open Sesame AB 50.992148 -114.070811 3.5 66 10 1 Canada 0 246
2 -2hOglg7Lh8ZgclQJ9ba2w URSA ON 43.644875 -79.415864 3.5 62 6 0 Canada 2 598
3 -2wh_ZsD2n5xFYgzppcxNg Bonga Korean Restaurant AB 51.051028 -114.061981 4.0 26 3 1 Canada 0 383
4 -33BHZKxCiREa1cfa672OQ Triple Crown Bar & Grill ON 43.741384 -79.316749 2.0 15 7 0 Canada 2 416
In [141]:
# need a lot of memory
#reviews = pd.read_json("yelp_academic_dataset_review.json", lines=True)[["business_id", "text"]]
In [142]:
#reviews.head()

Train-set

In [143]:
# #Takes 2 hours. Saved models are loaded in the next cells
# import time
# from sklearn.feature_extraction.text import TfidfVectorizer
# start_time = time.time()
# from scipy import sparse
# tfidf_train_reviews_full = pd.merge(train_tfidf[["business_id"]], reviews[["business_id", "text"]], on=["business_id"])
# # save dense matrix to file
# tfidf_train_reviews_full.to_csv("yelp-dataset/tfidf_train_reviews_full.csv")
# # tokens containing numbers, digits and underscore are ignored
# tfidf_vectorizer_train_full = TfidfVectorizer(stop_words='english', min_df=0.001, tokenizer=lemmatizer_tokenize)  
# # remove punctuation and numbers
# tfidf_train_reviews_full.text = tfidf_train_reviews_full.text.str.replace(r'[^\w\s]+|[\d]+', '')
# tfidf_features_train_full = tfidf_vectorizer_train_full.fit_transform(tfidf_train_reviews_full.text)
# # save vectorizer
# pickle.dump(tfidf_vectorizer_train_full, open("yelp-dataset/tfidf_vectorizer_train_full.sav", 'wb'))

# # save sparse matrix to file
# sparse.save_npz("yelp-dataset/tfidf_features_train_full.npz", tfidf_features_train_full)

# print("shape of dense matrix: ", tfidf_train_reviews_full.shape)
# print("shape of sparse matrix:(tfidf) ", tfidf_features_train_full.shape)
# print(time.time() - start_time)

Load the Vectors

In [144]:
tfidf_features_train_full = sparse.load_npz("yelp-dataset/tfidf_features_train_full.npz")
tfidf_vectorizer_train_full = pickle.load(open("yelp-dataset/tfidf_vectorizer_train_full.sav", 'rb'))
tfidf_train_reviews_full = pd.read_csv("yelp-dataset/tfidf_train_reviews_full.csv", index_col=0)
# get the indices of reviews for each business id
reviews_indices_dict_tfidf_train_full = tfidf_train_reviews_full.business_id.groupby(tfidf_train_reviews_full.business_id).groups
In [145]:
tfidf_features_train_full.shape
Out[145]:
(939226, 3229)
In [147]:
# one way to ensure accuracy of data
assert(list(np.arange(0, 939226, 1)) == tfidf_train_reviews_full.index.tolist())

Average Words Per Business

In [149]:
# 3229 words
len(tfidf_vectorizer_train_full.get_feature_names())
Out[149]:
3229
In [150]:
import time
start_time = time.time()
num_of_sparse_features = len(tfidf_vectorizer_train_full.vocabulary_)
# k goes over the business ids
# i counts
the_sparse_matrices_tf_idf_train_full = []

for i, k in enumerate(reviews_indices_dict_tfidf_train_full.keys()):
    if( i % 5000 == 0):
        print("creating row: ", i)
    # convert to list: row indices for the reviews of k
    indices_for_sparse_matrix = list(reviews_indices_dict_tfidf_train_full[k])
    # find means column-wise
    means_sparse = tfidf_features_train_full[indices_for_sparse_matrix, :].mean(axis=0).tolist()[0]
    the_sparse_matrices_tf_idf_train_full.append(means_sparse)
pickle.dump(the_sparse_matrices_tf_idf_train_full, open("yelp-dataset/the_sparse_matrices_tf_idf_train_full.sav", "wb"))    
print(time.time() - start_time)
creating row:  0
creating row:  5000
creating row:  10000
creating row:  15000
creating row:  20000
21.358734369277954
In [151]:
# one way to check accuracy of data
print(all(tfidf_train_reviews_full.business_id.iloc[i] <= tfidf_train_reviews_full.business_id.iloc[i+1] for i in range(len(tfidf_train_reviews_full.business_id)-1)))
print(all(train_tfidf.business_id.iloc[i] <= train_tfidf.business_id.iloc[i+1] for i in range(len(train_tfidf.business_id)-1)))
True
True
In [152]:
# one way to check data consistency
assert(len(sorted(reviews_indices_dict_tfidf_train_full.keys())) == len(sorted(train_tfidf.business_id)))
# selected businesses in review_indices-dict may not same with train because train is startified and shuffled
In [153]:
set(reviews_indices_dict_tfidf_train_full.keys()) == set(train_tfidf.business_id)
Out[153]:
True

Prepare X and Y

In [154]:
tf_idf_aggregate_train = sparse.csr_matrix(the_sparse_matrices_tf_idf_train_full)
# merge sparse and dense features
dense_f = train_tfidf[['is_open', 'cluster_labels_hdb', 'stars', 'review_count', 'latitude', 'longitude', 'age']].astype(float)
tf_idf_train_final = hstack((dense_f, tf_idf_aggregate_train)).tocsr()
tf_idf_train_final
Out[154]:
<20028x3236 sparse matrix of type '<class 'numpy.float64'>'
	with 9958095 stored elements in Compressed Sparse Row format>
In [155]:
tf_idf_aggregate_train.shape
Out[155]:
(20028, 3229)
In [156]:
X_train_tf_idf = tf_idf_train_final[:,1:]
# convert it to is_closed
y_train_tf_idf = np.where(tf_idf_train_final[:,0].toarray(), 1, 0)

Test-set

In [157]:
# #Takes 1 hour. Saved models are loaded in the next cells
# import time
# from sklearn.feature_extraction.text import TfidfVectorizer
# start_time = time.time()
# from scipy import sparse
# tfidf_test_reviews_full = pd.merge(test_tfidf, reviews, on=["business_id"])
# # save dense matrix to file
# tfidf_test_reviews_full.to_csv("yelp-dataset/tfidf_test_reviews_full.csv")
# # tokens containing numbers, digits and underscore are ignored
# # remove punctuation and numbers
# tfidf_test_reviews_full.text = tfidf_test_reviews_full.text.str.replace(r'[^\w\s]+|[\d]+', '')
# # use train vectorizer
# tfidf_features_test_full = tfidf_vectorizer_train_full.transform(tfidf_test_reviews_full.text)

# # save sparse matrix to file
# sparse.save_npz("yelp-dataset/tfidf_features_test_full.npz", tfidf_features_test_full)

# print("shape of dense matrix: ", tfidf_test_reviews_full.shape)
# print("shape of sparse matrix:(tfidf) ", tfidf_features_test_full.shape)
# print(time.time() - start_time)

Load the Vectors

In [158]:
tfidf_features_test_full = sparse.load_npz("yelp-dataset/tfidf_features_test_full.npz")
tfidf_test_reviews_full = pd.read_csv("yelp-dataset/tfidf_test_reviews_full.csv", index_col=0)
# get the indices of reviews for each business id
reviews_indices_dict_tfidf_test_full = tfidf_test_reviews_full.business_id.groupby(tfidf_test_reviews_full.business_id).groups

Average words per Business

In [159]:
import time
start_time = time.time()
num_of_sparse_features = len(tfidf_vectorizer_train_full.vocabulary_)
# k goes over the business ids
# i counts
the_sparse_matrices_tf_idf_test_full = []

for i, k in enumerate(reviews_indices_dict_tfidf_test_full.keys()):
    if( i % 1000 == 0):
        print("creating row: ", i)
    # convert to list: row indices for the reviews of k
    indices_for_sparse_matrix = list(reviews_indices_dict_tfidf_test_full[k])
    # find means column-wise
    means_sparse = tfidf_features_test_full[indices_for_sparse_matrix, :].mean(axis=0).tolist()[0]
    the_sparse_matrices_tf_idf_test_full.append(means_sparse)
pickle.dump(the_sparse_matrices_tf_idf_test_full, open("yelp-dataset/the_sparse_matrices_tf_idf_test_full.sav", "wb"))    
print(time.time() - start_time)
creating row:  0
creating row:  1000
creating row:  2000
creating row:  3000
creating row:  4000
creating row:  5000
5.135698318481445
In [160]:
# test consistency
assert(len(sorted(reviews_indices_dict_tfidf_test_full.keys())) == len(sorted(test_tfidf.business_id)))
# selected businesses in review_indices-dict may not same with train because train is startified and shuffled
In [161]:
set(reviews_indices_dict_tfidf_test_full.keys()) == set(test_tfidf.business_id)
Out[161]:
True

Prepare X and y

In [162]:
tf_idf_aggregate_test = sparse.csr_matrix(the_sparse_matrices_tf_idf_test_full)
# merge sparse and dense features
dense_f = test_tfidf[['is_open', 'cluster_labels_hdb', 'stars', 'review_count', 'latitude', 'longitude', 'age']].astype(float)
tf_idf_test_final = hstack((dense_f, tf_idf_aggregate_test)).tocsr()
tf_idf_test_final
Out[162]:
<5008x3236 sparse matrix of type '<class 'numpy.float64'>'
	with 2546072 stored elements in Compressed Sparse Row format>
In [163]:
X_test_tf_idf = tf_idf_test_final[:,1:]
# convert it to is_closed
y_test_tf_idf = np.where(tf_idf_test_final[:,0].toarray(), 1, 0)
In [164]:
assert(len(test_tfidf) == len(reviews_indices_dict_tfidf_test_full.keys()))
In [165]:
tf_idf_aggregate_test.shape
Out[165]:
(5008, 3229)
In [166]:
test_tfidf.shape
Out[166]:
(5008, 12)
In [167]:
X_test_tf_idf.shape
Out[167]:
(5008, 3235)

2. Bag of Words

In [168]:
# #Takes 2 hours. Saved models are loaded in the next cells
# import time
# from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# start_time = time.time()
# from scipy import sparse
# restaurants_reviews_full = pd.merge(restaurants_undersampled, reviews, on=["business_id"])
# # save dense matrix to file
# restaurants_reviews_full.to_csv("yelp-dataset/restaurants_reviews_full.csv")
# # tokens containing numbers, digits and underscore are ignored
# restaurants_vectorizer_bow_full = CountVectorizer(stop_words='english', min_df=0.001, tokenizer=lemmatizer_tokenize) 
# # remove punctuation and numbers
# restaurants_reviews_full.text = restaurants_reviews_full.text.str.replace(r'[^\w\s]+|[\d]+', '')
# bow_features_full = restaurants_vectorizer_bow_full.fit_transform(restaurants_reviews_full.text)
# # save vectorizer, the vectorizer contains the vocabulary
# pickle.dump(restaurants_vectorizer_bow_full, open("yelp-dataset/restaurants_vectorizer_bow_full.sav", 'wb'))
# # save sparse matrix to file
# sparse.save_npz("yelp-dataset/bow_features_full.npz", bow_features_full)

# print("shape of dense matrix: ", restaurants_reviews_full.shape)
# print("shape of sparse matrix:(tfidf) ", bow_features_full.shape)
# print(time.time() - start_time)

Load the Vectors

In [169]:
bow_features_full = sparse.load_npz("yelp-dataset/bow_features_full.npz")
restaurants_vectorizer_bow_full = pickle.load(open("yelp-dataset/restaurants_vectorizer_bow_full.sav", 'rb'))
restaurants_reviews_full = pd.read_csv("yelp-dataset/restaurants_reviews_full.csv", index_col=0)


# get the indices of reviews for each business id
reviews_indices_dict_bow_full = restaurants_reviews_full.business_id.groupby(restaurants_reviews_full.business_id).groups

print(bow_features_full.shape)
print(restaurants_reviews_full.shape)
(1181427, 3229)
(1181427, 13)

Average Bag of Words Vectors Per Business

In [170]:
import time
#takes 10 minutes
start_time = time.time()
num_of_sparse_features = len(restaurants_vectorizer_bow_full.vocabulary_)
# k goes over the business ids
# i counts
the_sparse_matrices_bag_full = []

# closed restaurants
for i, k in enumerate(reviews_indices_dict_bow_full.keys()):
    if( i % 10000 == 0):
        print("creating row: ", i)
    # convert to list: row indices for the reviews of k
    indices_for_sparse_matrix = list(reviews_indices_dict_bow_full[k])
    # find means column-wise
    means_sparse = bow_features_full[indices_for_sparse_matrix, :].mean(axis=0).tolist()[0]
    the_sparse_matrices_bag_full.append(means_sparse)
# save aggregate sparse matrix
pickle.dump(the_sparse_matrices_bag_full, open("yelp-dataset/the_sparse_matrices_bag_full.sav", "wb"))    
print(time.time() - start_time)
creating row:  0
creating row:  10000
creating row:  20000
29.668766736984253

Prepare X and Y

In [171]:
restaurants_bow_aggregate = sparse.csr_matrix(the_sparse_matrices_bag_full)
restaurants_bow_aggregate
Out[171]:
<25036x3229 sparse matrix of type '<class 'numpy.float64'>'
	with 12341061 stored elements in Compressed Sparse Row format>
In [172]:
restaurants_undersampled.shape
Out[172]:
(25036, 12)
In [173]:
# merge sparse and dense features--include cluster labels
dense_f = restaurants_undersampled[['is_open', 'cluster_labels_hdb', 'stars', 'review_count', 'latitude', 'longitude', 'age']].astype(float)
restaurants_bow_final = hstack((dense_f, restaurants_bow_aggregate)).tocsr()
restaurants_bow_final
Out[173]:
<25036x3236 sparse matrix of type '<class 'numpy.float64'>'
	with 12503199 stored elements in Compressed Sparse Row format>
In [174]:
X = restaurants_bow_final
# use cluster label to stratify train_test_split
y = restaurants_bow_final[:, 1].toarray()

# stratified sampling
train_bow, test_bow, _, _ = train_test_split(X, y, test_size=0.2, random_state=20, shuffle=True, stratify=y)
In [175]:
train_bow.shape
Out[175]:
(20028, 3236)
In [176]:
test_bow.shape
Out[176]:
(5008, 3236)
In [177]:
# hdbclusterlabel, dense features, sparse features
# the cluster label will be used for cross validation and icf penalty
X_train_bow = train_bow[:, 1:]
# is_closed?
y_train_bow = np.where(train_bow[:, 0].toarray() == 0, 1, 0)

X_test_bow = test_bow[:, 1:]
y_test_bow = np.where(test_bow[:, 0].toarray() ==0, 1, 0)
In [178]:
X_train_bow.shape
Out[178]:
(20028, 3235)
In [179]:
X_test_bow.shape
Out[179]:
(5008, 3235)
In [180]:
X_train_bow[:,0].toarray()
Out[180]:
array([[450.],
       [402.],
       [ 31.],
       ...,
       [355.],
       [158.],
       [594.]])
In [181]:
count_classes = pd.value_counts(y_train_bow.flatten(), sort=True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Balanced Classes")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()
print(count_classes)
0     9872
1    10156
dtype: int64
In [182]:
count_classes = pd.value_counts(y_test_bow.flatten(), sort=True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Balanced Classes")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()
print(count_classes)
0    2490
1    2518
dtype: int64

3. Bag of Words + Inverse Cluster Frequency (icf)

In [183]:
np.where(X_train_bow[:, 0].toarray().flatten() == 3.0)[0]
Out[183]:
array([ 1286,  1407,  2530,  3541,  4463,  5128,  7271,  7453,  8554,
        9415,  9740,  9817, 11221, 11849, 12571, 13745, 13870, 14365,
       15886, 17118, 19615, 19953])
In [184]:
from sklearn.base import BaseEstimator, TransformerMixin

class ICFPenalty(BaseEstimator, TransformerMixin):
    def transform(self, X, *_):
        #ipdb.set_trace()
        self.X = X
        total_number_of_businesses = X.shape[0] 
        cluster_indices = set(self.X[:, 0].toarray().flatten())
        for c_index in cluster_indices:
            # indices of cluster members
            members_indices = np.where(X[:, 0].toarray().flatten() == c_index)[0]
            # other cluster memebers
            non_members_indices = np.where(X[:, 0].toarray().flatten() != c_index)[0]
            # textual features start from six
            # how many times a word occured in other clusters
            denominator = (1 + np.array(self.X[non_members_indices, 6:].sum(axis=0).tolist()[0]))
            idf = np.log10( total_number_of_businesses / denominator).reshape(1,-1)
            self.X[members_indices, 6:] = sparse.csr_matrix(self.X[members_indices, 6:].toarray() * idf)
        return self.X

   
    def fit_transform(self, X,  *_):
        return self.transform(X)

    def fit(self, *_):
        return self
In [185]:
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline, make_pipeline

pipeline = make_pipeline(ICFPenalty(), Normalizer())
#  the cluster label is on the to the front
X_train_bow_penalized = pipeline.fit_transform(X_train_bow)
# same target classes: y_train_bow and y_test_bow

X_test_bow_penalized = pipeline.transform(X_test_bow)
# same target classes: y_test_bow and y_test_bow

4. Word2Vec Similarities

In [186]:
# # Load Google's pre-trained Word2Vec model.
# import gensim
# from gensim.models import word2vec
# google_model = gensim.models.KeyedVectors.load_word2vec_format ('/home/tigial3535/google_word_vectors.bin.gz', binary=True)
In [187]:
# restaurants_vectorizer_bow2.get_feature_names()
In [188]:
# print(google_model.wv.most_similar(['order'], topn=30))
In [189]:
# google_model.wv.most_similar

Surrounding Restaurants

Within 1 mile circle, average ratings, average number of reviews

Training data supplied to this transformer will have stars, review_count, latitude, longitude and age. In a pipeline Distance Transformer is the last one. Bag of words features contain the target class. So the a pipeline step should handle it correctly

In [190]:
X_train_bow.shape
Out[190]:
(20028, 3235)

These are 3229 textual features, 1 cluster label, 5 dense features

In [191]:
# I can use one of the X_trains for testing surrounding restaurants
print("min latitude", X_train_bow[:,3].min())
print("max latitude", X_train_bow[:,3].max())
print("min longtiude", X_train_bow[:,4].min())
print("max longtiude", X_train_bow[:,4].max())
min latitude 33.2182
max latitude 51.301049335
min longtiude -115.3376537
max longtiude -73.4353725
In [192]:
X_train_sep = pd.DataFrame({'average_stars':X_train_bow[:,1].toarray().flatten(),
                            'review_count':X_train_bow[:,2].toarray().flatten(),
                            'latitude':X_train_bow[:,3].toarray().flatten(),
                            'longitude':X_train_bow[:,4].toarray().flatten()})


X_train_sep.head()
Out[192]:
average_stars review_count latitude longitude
0 4.5 3.0 43.689670 -79.295978
1 3.5 3.0 40.388520 -80.089117
2 3.5 20.0 40.137805 -88.259692
3 4.0 3.0 33.664516 -112.202635
4 4.0 13.0 33.801046 -112.130346
In [193]:
# A vectorized implemenation of geopy.distance.great_circle function
# Adopted by Tinsae 
# reference
# https://gist.github.com/rochacbruno/2883505
    
import math
# calculate the disance between a business and all other businesses
def distance_c(many_points, one_point):
    lat1 = many_points[:,0]
    lon1 = many_points[:,1]
    lat2 = one_point[0]
    lon2 = one_point[1]
    #radius = 6371 # km
    radius = 3959 # miles
    dlat = np.radians(lat2 - lat1)
    dlon = np.radians(lon2 - lon1)
    a = np.sin(dlat/2) * np.sin(dlat/2) + np.cos(np.radians(lat1)) \
        * np.cos(np.radians(lat2)) * np.sin(dlon/2) * np.sin(dlon/2)
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    d = radius * c

    return d

# for testing purposes(calculating the distance between buisness@1 and others)
business5 = X_train_sep[['latitude', 'longitude']].values[5:6]
business8 = X_train_sep[['latitude', 'longitude']].values[8:9]
business18 = X_train_sep[['latitude', 'longitude']].values[18:19]
business1 = X_train_sep[['latitude', 'longitude']].values[1]

print("distance from 5: ", great_circle(business5, business1).miles)
print("distance from 8: ", great_circle(business8, business1).miles)
print("distance from 18: ", great_circle(business18, business1).miles)

distance_c(X_train_sep[['latitude', 'longitude']].values[0:20], business1)       
distance from 5:  358.36982240151036
distance from 8:  1916.471206338919
distance from 18:  1805.7737221144234
Out[193]:
array([ 231.70106918,    0.        ,  431.00664396, 1820.47110066,
       1812.83363887,  358.39141659, 1911.32759375, 1905.29382278,
       1916.58668658,  483.88230214,  373.82917463, 1818.7765574 ,
       1834.66829374, 1801.4768427 ,  232.25875604,  513.6565262 ,
        483.20706774,  228.93261952, 1805.88253209, 1811.22909367])

Distance Transformer

In [194]:
from sklearn.base import BaseEstimator, TransformerMixin

class DistanceTransformer(BaseEstimator, TransformerMixin):
    def transform(self, X, *_):
        self.nearest_average_rating=[]
        self.nearest_average_num_of_reviews =[]

        self.X = X
        self.X_sep = pd.DataFrame({'average_stars':X[:,0].toarray().flatten(),
                            'review_count':X[:,1].toarray().flatten(),
                            'latitude':X[:,2].toarray().flatten(),
                            'longitude':X[:,3].toarray().flatten()})
        
        self.X_sep.apply(lambda current: self.create_distance_features(current), axis=1)
        nearest_average_rating = np.array(self.nearest_average_rating).reshape(-1,1)
        nearest_average_num_of_reviews = np.array(self.nearest_average_num_of_reviews).reshape(-1,1)
        # add two featuers after the cluster labels
        return sparse.hstack((self.X[:, 0].astype(float), nearest_average_rating, nearest_average_num_of_reviews, self.X[:, 1:].astype(float)))
    
    # calculate the disance between a business and all other businesses
    def distance_c(self, many_points, one_point):
        # convert the values to float
        lat1 = many_points[:,0]
        lon1 = many_points[:,1]
        lat2 = one_point[0]
        lon2 = one_point[1]
        #radius = 6371 # km
        radius = 3959 # miles
        dlat = np.radians(lat2 - lat1)
        dlon = np.radians(lon2 - lon1)
        a = np.sin(dlat/2) * np.sin(dlat/2) + np.cos(np.radians(lat1)) \
            * np.cos(np.radians(lat2)) * np.sin(dlon/2) * np.sin(dlon/2)
        c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
        d = radius * c
        return d        
    
    def create_distance_features(self, current):
        # find the distance between current row and all other rows using the vectorized function
        the_distances = self.distance_c(self.X_sep[['latitude', 'longitude']].values, (current.latitude, current.longitude))    
        # A distance cannot be negative. 0 indicates it is the same place
        # so both shall be excluded
        indices_to_select = np.where((the_distances < 1) & (the_distances > 0))[0]
        # if a business within one mile is found
        if(len(indices_to_select) > 0):
            self.nearest_average_rating.append(self.X_sep.iloc[indices_to_select, :].average_stars.mean())
            self.nearest_average_num_of_reviews.append(self.X_sep.iloc[indices_to_select, :].review_count.mean())
        else:
            self.nearest_average_rating.append(np.nan)
            self.nearest_average_num_of_reviews.append(np.nan)    
   
    def fit_transform(self, X,  *_):
        return self.transform(X)

    
    def fit(self, *_):
        return self

Custom Normalizer

In [195]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Normalizer, StandardScaler

class CustomNormalizer(BaseEstimator, TransformerMixin):        

    def transform(self, X,  *_):
        self.X = X
        # column 5 and 6 are removed(latitude and longitude)
        # cluster labels are not scaled, standard scaler("nearest.., nearest.., stars, review_count,age ),
        # normalizer(text features)
        return sparse.hstack((self.X[:, 0], StandardScaler().fit_transform(self.X[:, [1,2,3,4,7]].toarray()),  
                              Normalizer().fit_transform(self.X[:, 8:]) )).tocsr()
    
    def fit_transform(self, X,  *_):
        return self.transform(X)
    
    def fit(self, *_):
        return self

Latent Semantic Analysis

In [371]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Normalizer, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD

class LSA(BaseEstimator, TransformerMixin):        
    def __init__(self, dim):
        self.dim = dim
    
    def fit(self, X,  *_):
        self.X = X.tocsr()
        self.svd = TruncatedSVD(self.dim)
        lsa = make_pipeline(self.svd, Normalizer(copy=False))
        lsa.fit(self.X[:, 6:])
        variance_explained = self.svd.explained_variance_ratio_
        total_variance = variance_explained.sum()
        print("Percent variance captured by all components:",total_variance * 100)
        return self
    def transform(self, X,  *_):
        # reduce the textual features only
        return sparse.hstack((X[:, 0:6], self.svd.transform(X[:, 6:]))).tocsr()
    
    def fit_transform(self, X,  *_):
        return self.fit(X).transform(X)  
In [197]:
# # test code
# sc = StandardScaler()
# sc.fit_transform(X_train[:,2].toarray())
# X_train[0:5, 0:5].toarray()
# X_train_transformed.tocsr()[0:5, 0:5].toarray()
# CNR = CustomNormalizer()
# X_train_transformed = CNR.transform(X_train)
# (X_train_transformed).tocsr()[0:5, 0:5].toarray()
In [198]:
# this is used for testing the tranformer

# DFC = DistanceFeaturesCreator()
# X_train_transformed = DFC.transform(X_train)
# (X_train_transformed).tocsr()[0:5, 0:5].toarray()


# import time

# nearest_average_rating = []
# nearest_average_num_of_reviews =  []
# start_time = time.time()

# X_train_sep2 = X_train_sep.copy(deep=True)

# def create_distance_features(current):
#     # find the distance between current row and all other rows using the vectorized function
#     the_distances = distance_c(X_train_sep2[['latitude', 'longitude']].values, (current.latitude, current.longitude))    
#     # A distance cannot be negative. 0 indicates it is the same place
#     # so both shall be excluded
#     indices_to_select = np.where((the_distances < 1) & (the_distances > 0))[0]
#     # if a business within one mile is found
#     if(len(indices_to_select) > 0):
#         nearest_average_rating.append(X_train_sep2.iloc[indices_to_select, :].average_stars.mean())
#         nearest_average_num_of_reviews.append(X_train_sep2.iloc[indices_to_select, :].review_count.mean())
#     else:
#         nearest_average_rating.append("No-Nearest")
#         nearest_average_num_of_reviews.append("No-Nearest")

        
# X_train_sep2.apply(lambda x: create_distance_features(x), axis=1)
# print(time.time() - start_time, " seconds")


# X_train_sep2["nearest_average_rating"] = nearest_average_rating
# X_train_sep2["nearest_average_num_of_reviews"] = nearest_average_num_of_reviews
# X_train_sep2.head()
In [199]:
# # This is for testing

# X_train_transformed
# RNN = RemoveNoNearest()

# RNN = RemoveNoNearest()
# X_transformed_again, y_transformed = RNN.transform(X_train_transformed.tocsr(), y_train)

# X_transformed_again = RNN.transform(X_train_transformed.tocsr())
# X_train_transformed
# X_transformed_again
# X_train_transformed.tocsr()[312, 0:2].toarray()
# np.isnan(X_transformed_again.toarray()).sum()
In [200]:
# from sklearn.preprocessing import StandardScaler
# with warnings.catch_warnings():
#     warnings.simplefilter("ignore")
#     scaler = StandardScaler()
#     dense_train_sel_sorted[dense_train_sel_sorted.columns.drop("is_closed")] = scaler.fit_transform(dense_train_sel_sorted[dense_train_sel_sorted.columns.drop("is_closed")])
# dense_train_sel_sorted.head(5)
In [201]:
from sklearn.impute import SimpleImputer
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer())
X_train_bow_fplot = pipeline.fit_transform(X_train_bow)
In [202]:
# create dataframe for plotting purpose
column_headers = ['nearest_average_rating', 'nearest_averge_number_of_reviews','average_stars', 
                  'review_count', 'age_of_restaurant']
df = pd.DataFrame(X_train_bow_fplot.tocsr()[:, 1:6 ].toarray(), columns=column_headers)
df.head()
Out[202]:
nearest_average_rating nearest_averge_number_of_reviews average_stars review_count age_of_restaurant
0 -0.072763 -0.018319 1.387474 -0.588410 0.228023
1 0.391776 -1.322321 0.115056 -0.588410 -0.095729
2 0.000000 0.000000 0.115056 -0.360688 1.523031
3 -0.048877 0.296440 0.751265 -0.588410 -1.714489
4 0.000000 0.000000 0.751265 -0.454456 0.228023
In [203]:
# Code from https://seaborn.pydata.org/examples/many_pairwise_correlations.html
sns.set(style="white")

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
sns.despine()
In [205]:
sns.pairplot(df)
sns.despine()

2. Baseline Modeling

I am using undersampled data to test logistic regression and decsion tree models. Other estimators don't perform good. The pipelines transform train and test set separately. I tried to use three forms of textual features.

  • TF-IDF
  • Bag of Words
  • Bag of Words + Inverse Cluster Frequency (icf)
In [206]:
from sklearn.metrics import classification_report, confusion_matrix,roc_auc_score,average_precision_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import lightgbm as lgb 
In [207]:
# these are the first 5 features(1,2,3,4,5), 0 is cluster label
first_5 = ["nearest_average_rating", "nearest_averge_number_of_reviews", "average_stars", "review_count", "age_of_restaurant"] 
first_5
Out[207]:
['nearest_average_rating',
 'nearest_averge_number_of_reviews',
 'average_stars',
 'review_count',
 'age_of_restaurant']
In [208]:
print("total words", len(restaurants_vectorizer_bow_full.get_feature_names()) )
# these are the remaining for bag of words
restaurants_vectorizer_bow_full.get_feature_names()[:10]
total words 3229
Out[208]:
['ability',
 'able',
 'absolute',
 'absolutely',
 'ac',
 'accent',
 'accept',
 'acceptable',
 'access',
 'accessible']
In [209]:
print("total words", len(tfidf_vectorizer_train_full.get_feature_names()) )
# these are the remaining for bag of words
tfidf_vectorizer_train_full.get_feature_names()[:10]
total words 3229
Out[209]:
['ability',
 'able',
 'absolute',
 'absolutely',
 'ac',
 'accent',
 'accept',
 'acceptable',
 'access',
 'accessible']
In [210]:
# combine the features
all_the_features_bow =  first_5 + list(restaurants_vectorizer_bow_full.get_feature_names())
len(all_the_features_bow)
Out[210]:
3234
In [211]:
# combine the features
all_the_features_tfidf =  first_5 + list(tfidf_vectorizer_train_full.get_feature_names())
len(all_the_features_tfidf)
Out[211]:
3234
In [282]:
def estimate(X_train, y_train, X_test, y_test, model_type, penalty_arg='l2', C_arg=1.0,
            max_depth_arg=50, n_estimators_arg=900,  min_samples_leaf_arg=15):
    if(model_type == "logistic_regression"):
        model = LogisticRegression(penalty=penalty_arg, C=C_arg)
    elif(model_type == "decision_tree"):
        model = DecisionTreeClassifier(max_depth=10)
    elif(model_type == "random_forest"):
        model = RandomForestClassifier(max_depth=max_depth_arg, n_estimators=n_estimators_arg, min_samples_leaf=min_samples_leaf_arg, n_jobs=-1)
    elif(model_type == "XGBoost"):
        model = XGBClassifier()
        
    elif(model_type == "lightgbm"): 
        lgb_train_data = lgb.Dataset(X_train, label=y_train)
        params = {}
        params['learning_rate'] = 0.003
        params['boosting_type'] = 'gbdt'
        params['objective'] = 'binary'
        params['metric'] = 'binary_error'
        params['sub_feature'] = 0.5
        params['num_leaves'] = 150
        params['min_data'] = 50
        params['max_depth'] = 30

        #training our model using light gbm
        num_round=1000
        model = lgb.train(params, lgb_train_data, num_round)

       
    if(model_type == "lightgbm"):
        y_pred = model.predict(X_test)
        #converting probabilities into 0 or 1
        for i in range(0, len(y_test)):
            if y_pred[i] >= .5:
                y_pred [i] = 1
            else:  
                y_pred [i] = 0
        
        y_pred_train = model.predict(X_train)
        #converting probabilities into 0 or 1
        for i in range(0,len(y_train)):
            if y_pred_train[i] >= .5:
                y_pred_train [i] = 1
            else:  
                y_pred_train [i] = 0
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_train = model.predict(X_train)
        
    print('\n Confusion Matrix')
    print(pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted']))
    print(classification_report(y_test, y_pred))
    print("ROC-AUC", roc_auc_score(y_test, y_pred))
    print("PR-AUC-Score", average_precision_score(y_test, y_pred))
    print("Test Accuracy: ", accuracy_score(y_test, y_pred))
    print("Train Accuracy: ", accuracy_score(y_train, y_pred_train))
    return model
In [284]:
def analyze_features(feature_names, coeffs=None, model_type="logistic", limit=10):
    if(model_type == "logistic"):
        print("Most positive cofficeints\n(contribute positively to restaurants closure)")
        print()
        sorted_features = np.argsort(coeffs).ravel()[::-1]
        for i in range(limit):
            print(feature_names[sorted_features[i]], end=",")
        print()
        print("\nMost negative cofficeints\n(contribute negatively to restaurants closure)")
        print()
        sorted_features = np.argsort(coeffs).ravel()
        for i in range(limit):
            print(feature_names[sorted_features[i]],end=",")
    elif(model_type=="lightgbm"):
        sorted_features = np.argsort(model.feature_importance())[::-1]
        for i in range(limit):
            print(feature_names[sorted_features[i]], end=",")
            
        feature_importance = model.feature_importance()
        # Make importances relative to max importance.
        feature_importance = 100.0 * (feature_importance / feature_importance.max())
        # argsort sorts from smallest to largest
        sorted_idx = np.argsort(feature_importance)[-30:]
        pos = np.arange(sorted_idx.shape[0]) + .9
        # # use only one column of a 1 x 2 figure
        plt.subplot(1, 2, 2)
        # pos is the postion of the bars on the y axis
        # feature_importance[sorted_idx] is the width of the bars
        plt.barh(pos, feature_importance[sorted_idx], align='center')
        plt.yticks(pos, np.array(feature_names)[[sorted_idx]])
        plt.xlabel('Relative Importance')
        plt.title('Variable Importance')
        plt.show()
          
    else:
        sorted_features = np.argsort(model.feature_importances_)[::-1]
        for i in range(limit):
            print(feature_names[sorted_features[i]], end=",")
        # plot
        
        feature_importance = model.feature_importances_
        # Make importances relative to max importance.
        feature_importance = 100.0 * (feature_importance / feature_importance.max())
        # argsort sorts from smallest to largest
        sorted_idx = np.argsort(feature_importance)[-30:]
        pos = np.arange(sorted_idx.shape[0]) + .9
        # # use only one column of a 1 x 2 figure
        plt.subplot(1, 2, 2)
        # pos is the postion of the bars on the y axis
        # feature_importance[sorted_idx] is the width of the bars
        plt.barh(pos, feature_importance[sorted_idx], align='center')
        plt.yticks(pos, np.array(feature_names)[[sorted_idx]])
        plt.xlabel('Relative Importance')
        plt.title('Variable Importance')
        plt.show()   
        

A. Simple TF-IDF Model

The preprocessing is done in the previous section in this way. I split the train and test sets before a and converted each of them to tfidf vectors. The same vocabulary is used for both. Doing a train-test split after tfidf vectorization creates a weak model because test-set data will leak to the train-set.

In [213]:
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer())
In [214]:
X_train_tf_idf_t = pipeline.fit_transform(X_train_tf_idf)
In [215]:
X_test_tf_idf_t = pipeline.fit_transform(X_test_tf_idf)
In [216]:
y_train_tf_idf = y_train_tf_idf.ravel()
y_test_tf_idf = y_test_tf_idf.ravel()
In [217]:
# this cluster labels are untouched by the transformations
X_train_tf_idf_t[:, 0].toarray()
Out[217]:
array([[354.],
       [ 21.],
       [201.],
       ...,
       [234.],
       [632.],
       [379.]])
In [223]:
# logistic regression, ignore cluster labels
model = estimate(X_train_tf_idf_t[:, 1:], y_train_tf_idf, X_test_tf_idf_t[:,1:], 
                 y_test_tf_idf, model_type="logistic_regression", penalty_arg='l2', C_arg=1.0)

analyze_features(all_the_features_tfidf, model.coef_, limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1921   586
1           523  1978
              precision    recall  f1-score   support

           0       0.79      0.77      0.78      2507
           1       0.77      0.79      0.78      2501

   micro avg       0.78      0.78      0.78      5008
   macro avg       0.78      0.78      0.78      5008
weighted avg       0.78      0.78      0.78      5008

ROC-AUC 0.7785690669882825
PR-AUC-Score 0.714560775077906
Test Accuracy:  0.7785543130990416
Train Accuracy:  0.8075694028360295
Most positive cofficeints
(contribute positively to restaurants closure)

year,rude,busy,calgary,favorite,favourite,dive,store,flavour,line,location,recommend,usually,goto,fast,

Most negative cofficeints
(contribute negatively to restaurants closure)

close,vegas,groupon,owner,think,new,concept,business,hope,open,paradise,menu,sign,chef,yelp,
In [273]:
# decision tree
model = estimate(X_train_tf_idf_t[:, 1:], y_train_tf_idf, X_test_tf_idf_t[:,1:], 
                 y_test_tf_idf, model_type="decision_tree")
analyze_features(all_the_features_tfidf, model_type="other")
 Confusion Matrix
Predicted     0     1
True                 
0          1717   790
1           817  1684
              precision    recall  f1-score   support

           0       0.68      0.68      0.68      2507
           1       0.68      0.67      0.68      2501

   micro avg       0.68      0.68      0.68      5008
   macro avg       0.68      0.68      0.68      5008
weighted avg       0.68      0.68      0.68      5008

ROC-AUC 0.6791064986051849
PR-AUC-Score 0.621461065130615
Test Accuracy:  0.6791134185303515
Train Accuracy:  0.80507289794288
age_of_restaurant,rude,fast,review_count,think,location,close,customer,owner,groupon,

B. Bag of Words Model

In the preprocessing step, the bag of words vectors were created for all the data. And then the data is split into train and test set. Unlike the tfidf transformation, bag of words vectorization don't leak data. Every business has its own reviews and words are counted based on those reviews. Coupus-wide information like inverse document frequency is not used.

In [253]:
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer())
In [254]:
X_train_bow_t = pipeline.fit_transform(X_train_bow)
X_test_bow_t = pipeline.fit_transform(X_test_bow)
In [255]:
y_train_bow = y_train_bow.ravel()
y_test_bow = y_test_bow.ravel()
In [258]:
X_train_bow[:, 6].toarray()
Out[258]:
array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]])
In [274]:
# logistic regression, cluster labels are not used for estimation
model = estimate(X_train_bow_t[:, 1:], y_train_bow, X_test_bow_t[:,1:], 
                 y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)

analyze_features(all_the_features_bow, model.coef_, limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1468  1022
1           801  1717
              precision    recall  f1-score   support

           0       0.65      0.59      0.62      2490
           1       0.63      0.68      0.65      2518

   micro avg       0.64      0.64      0.64      5008
   macro avg       0.64      0.64      0.64      5008
weighted avg       0.64      0.64      0.64      5008

ROC-AUC 0.6357243110647515
PR-AUC-Score 0.5874014820280926
Test Accuracy:  0.6359824281150159
Train Accuracy:  0.6433493109646495
Most positive cofficeints
(contribute positively to restaurants closure)

age_of_restaurant,good,food,place,average_stars,order,nearest_average_rating,come,like,great,service,egg,think,breakfast,bar,

Most negative cofficeints
(contribute negatively to restaurants closure)

review_count,nearest_averge_number_of_reviews,taco,gyro,pho,pie,indian,chicken,philly,long,italian,cuban,fast,cook,grill,
In [275]:
# decision tree, cluster labels are not used as features
model = estimate(X_train_bow_t[:,1:], y_train_bow, X_test_bow_t[:,1:], 
                 y_test_bow, model_type="decision_tree")
analyze_features(all_the_features_bow, model_type="other", limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1698   792
1          1026  1492
              precision    recall  f1-score   support

           0       0.62      0.68      0.65      2490
           1       0.65      0.59      0.62      2518

   micro avg       0.64      0.64      0.64      5008
   macro avg       0.64      0.64      0.64      5008
weighted avg       0.64      0.64      0.64      5008

ROC-AUC 0.6372307338966668
PR-AUC-Score 0.5919389143543411
Test Accuracy:  0.6369808306709265
Train Accuracy:  0.7446574795286599
age_of_restaurant,review_count,nearest_average_rating,good,average_stars,food,different,order,orange,soft,certain,occasion,like,wait,visit,

C. Bag of Words with ICF Penalty

In [233]:
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer())
In [234]:
X_train_bow_penalized_t = pipeline.fit_transform(X_train_bow_penalized)
X_test_bow_penalized_t = pipeline.fit_transform(X_test_bow_penalized)
In [276]:
# logistic regression
model = estimate(X_train_bow_penalized_t[:, 1:], y_train_bow, X_test_bow_penalized_t[:,1:], 
                 y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)

analyze_features(all_the_features_bow, model.coef_, limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1339  1151
1           627  1891
              precision    recall  f1-score   support

           0       0.68      0.54      0.60      2490
           1       0.62      0.75      0.68      2518

   micro avg       0.64      0.64      0.64      5008
   macro avg       0.65      0.64      0.64      5008
weighted avg       0.65      0.64      0.64      5008

ROC-AUC 0.6443719277427422
PR-AUC-Score 0.5920397469571629
Test Accuracy:  0.6449680511182109
Train Accuracy:  0.6533852606351108
Most positive cofficeints
(contribute positively to restaurants closure)

age_of_restaurant,average_stars,egg,breakfast,bar,think,brisket,good,orange,greek,offer,express,new,slider,ingredient,

Most negative cofficeints
(contribute negatively to restaurants closure)

nearest_averge_number_of_reviews,review_count,nearest_average_rating,chicken,pho,taco,pie,gyro,dog,philly,indian,cuban,veal,italian,mexican,
In [277]:
# decision tree
model = estimate(X_train_bow_penalized_t[:, 1:], y_train_bow, X_test_bow_penalized_t[:,1:], 
                 y_test_bow, model_type="decision_tree")
analyze_features(all_the_features_bow, model_type="other", limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1747   743
1          1035  1483
              precision    recall  f1-score   support

           0       0.63      0.70      0.66      2490
           1       0.67      0.59      0.63      2518

   micro avg       0.64      0.64      0.64      5008
   macro avg       0.65      0.65      0.64      5008
weighted avg       0.65      0.64      0.64      5008

ROC-AUC 0.6452829586814295
PR-AUC-Score 0.5990444081983031
Test Accuracy:  0.6449680511182109
Train Accuracy:  0.7539444777311763
age_of_restaurant,review_count,nearest_average_rating,nearest_averge_number_of_reviews,average_stars,try,ask,available,tuesday,stick,poorly,clean,free,locate,stuff,

In a prior work the maximum accuracy was obtained by logistic regression which is around 72%. In this project, I applied a spacy lemmatizer for all models. Still, Logistic regression surpasses the others. It achieved around 75% accuracy in the baseline model. Hopefully, the model may improve after hyperparameter tuning

3. Ensembles

The baseline models performed better with the tfidf features. The bag of words features are not bad. I choose the bag of words features because its results are interpretable and doing gridsearchcv is possible without data leakage. Next I will try to use ensemble models

A. Bag of Words Model

In [278]:
# random forest
model = estimate(X_train_bow_t[:, 1:], y_train_bow, X_test_bow_t[:,1:], 
                 y_test_bow, model_type="random_forest")

analyze_features(all_the_features_bow, model_type="other")
 Confusion Matrix
Predicted     0     1
True                 
0          1632   858
1           907  1611
              precision    recall  f1-score   support

           0       0.64      0.66      0.65      2490
           1       0.65      0.64      0.65      2518

   micro avg       0.65      0.65      0.65      5008
   macro avg       0.65      0.65      0.65      5008
weighted avg       0.65      0.65      0.65      5008

ROC-AUC 0.6476075868206743
PR-AUC-Score 0.5985696434019193
Test Accuracy:  0.6475638977635783
Train Accuracy:  0.9758837627321749
age_of_restaurant,review_count,nearest_average_rating,nearest_averge_number_of_reviews,good,food,place,like,average_stars,order,
In [279]:
# xgboost
model = estimate(X_train_bow_t[:, 1:], y_train_bow, X_test_bow_t[:,1:], 
#                  y_test_bow, model_type="XGBoost")

analyze_features(all_the_features_bow, model_type="other", limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1758   732
1           935  1583
              precision    recall  f1-score   support

           0       0.65      0.71      0.68      2490
           1       0.68      0.63      0.66      2518

   micro avg       0.67      0.67      0.67      5008
   macro avg       0.67      0.67      0.67      5008
weighted avg       0.67      0.67      0.67      5008

ROC-AUC 0.6673488234111985
PR-AUC-Score 0.6165890664397385
Test Accuracy:  0.667132587859425
Train Accuracy:  0.7066606750549231
age_of_restaurant,review_count,average_stars,good,nearest_average_rating,nearest_averge_number_of_reviews,additionally,likely,food,teach,butter,refill,crispy,weekday,order,
In [280]:
# lightgbm
model = estimate(X_train_bow_t[:, 1:], y_train_bow, X_test_bow_t[:,1:], 
                 y_test_bow, model_type="lightgbm")
analyze_features(all_the_features_bow, model_type="lightgbm", limit=20)
 Confusion Matrix
Predicted   0.0   1.0
True                 
0          1674   816
1           861  1657
              precision    recall  f1-score   support

           0       0.66      0.67      0.67      2490
           1       0.67      0.66      0.66      2518

   micro avg       0.67      0.67      0.67      5008
   macro avg       0.67      0.67      0.67      5008
weighted avg       0.67      0.67      0.67      5008

ROC-AUC 0.6651755552790989
PR-AUC-Score 0.6128503781402554
Test Accuracy:  0.6651357827476039
Train Accuracy:  0.9338925504293989
age_of_restaurant,review_count,nearest_average_rating,nearest_averge_number_of_reviews,average_stars,good,food,place,order,think,long,little,price,like,time,fresh,friendly,wait,s,staff,
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-280-31ae16ac4e4a> in <module>
      2 model = estimate(X_train_bow_t[:, 1:], y_train_bow, X_test_bow_t[:,1:], 
      3                  y_test_bow, model_type="lightgbm")
----> 4 analyze_features(all_the_features_bow, model_type="lightgbm", limit=20)

<ipython-input-272-ef58057a6513> in analyze_features(feature_names, coeffs, model_type, limit)
     28         # feature_importance[sorted_idx] is the width of the bars
     29         plt.barh(pos, feature_importance[sorted_idx], align='center')
---> 30         plt.yticks(pos, np.array(all_the_features)[[sorted_idx]])
     31         plt.xlabel('Relative Importance')
     32         plt.title('Variable Importance')

NameError: name 'all_the_features' is not defined
In [285]:
analyze_features(all_the_features_bow, model_type="lightgbm", limit=20)
age_of_restaurant,review_count,nearest_average_rating,nearest_averge_number_of_reviews,average_stars,good,food,place,order,think,long,little,price,like,time,fresh,friendly,wait,s,staff,

C. Bag of Words with ICF Penalty

In [286]:
# random forest
model = estimate(X_train_bow_penalized_t[:, 1:], y_train_bow, X_test_bow_penalized_t[:,1:], 
                 y_test_bow, model_type="random_forest")

analyze_features(all_the_features_bow, model_type="other")
 Confusion Matrix
Predicted     0     1
True                 
0          1602   888
1           920  1598
              precision    recall  f1-score   support

           0       0.64      0.64      0.64      2490
           1       0.64      0.63      0.64      2518

   micro avg       0.64      0.64      0.64      5008
   macro avg       0.64      0.64      0.64      5008
weighted avg       0.64      0.64      0.64      5008

ROC-AUC 0.6390020766146396
PR-AUC-Score 0.5916464538301363
Test Accuracy:  0.6389776357827476
Train Accuracy:  0.9578589974036349
review_count,age_of_restaurant,nearest_average_rating,nearest_averge_number_of_reviews,average_stars,good,food,place,order,like,
In [287]:
# lightgbm

# # lightgbm
# model = estimate(X_train_bow_penalized_t[:, 1:], y_train_bow, X_test_bow_penalized_t[:,1:], 
#                  y_test_bow, model_type="lightgbm")
# analyze_features(all_the_features_bow, model_type="lightgbm", limit=20)
In [288]:
# xgboost
model = estimate(X_train_bow_penalized_t[:, 1:], y_train_bow, X_test_bow_penalized_t[:,1:], 
                 y_test_bow, model_type="XGBoost")
analyze_features(all_the_features_bow, model_type="other", limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1680   810
1           828  1690
              precision    recall  f1-score   support

           0       0.67      0.67      0.67      2490
           1       0.68      0.67      0.67      2518

   micro avg       0.67      0.67      0.67      5008
   macro avg       0.67      0.67      0.67      5008
weighted avg       0.67      0.67      0.67      5008

ROC-AUC 0.6729331942543804
PR-AUC-Score 0.6190447563485397
Test Accuracy:  0.672923322683706
Train Accuracy:  0.7096065508288396
age_of_restaurant,nearest_average_rating,review_count,nearest_averge_number_of_reviews,average_stars,time,like,additionally,dinein,giant,hesitant,lamp,lead,rack,meet,

4. Parameter Tuning and Testing

Best Parameters for Logistic Regression

In [289]:
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer(), LogisticRegression())
In [290]:
# with random split
from sklearn.model_selection import ParameterGrid, GridSearchCV
param_grid = {'logisticregression__C': [0.1, 0.2, 0.25, 0.5, 1.0, 1.25,1.5], 
                   'logisticregression__penalty' :['l2']
                  }
logistic_search = GridSearchCV(pipeline, param_grid, iid=False, cv=3,
                      return_train_score=True, n_jobs=-1)
logistic_search.fit(X_train_bow[:, 1:], y_train_bow)
print("Best parameter (CV score=%0.3f):" % logistic_search.best_score_)
print(logistic_search.best_params_)
/home/tigial3535/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/tigial3535/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/tigial3535/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/tigial3535/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

Best parameter (CV score=0.628):
{'logisticregression__C': 0.1, 'logisticregression__penalty': 'l2'}
In [291]:
logistic_search.cv_results_
Out[291]:
{'mean_fit_time': array([71.31044475, 72.05284842, 70.752496  , 72.53849506, 74.39594062,
        74.57795684, 49.42227976]),
 'std_fit_time': array([0.80730862, 0.6691186 , 0.45659213, 1.22568859, 0.14946809,
        0.06167883, 0.74894081]),
 'mean_score_time': array([29.07236258, 29.21293894, 29.96603266, 30.17933154, 29.1438605 ,
        28.99530522, 18.32677666]),
 'std_score_time': array([0.38013835, 0.29198393, 0.25285803, 0.15410511, 0.18112025,
        0.15008002, 0.10077868]),
 'param_logisticregression__C': masked_array(data=[0.1, 0.2, 0.25, 0.5, 1.0, 1.25, 1.5],
              mask=[False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_logisticregression__penalty': masked_array(data=['l2', 'l2', 'l2', 'l2', 'l2', 'l2', 'l2'],
              mask=[False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'logisticregression__C': 0.1,
   'logisticregression__penalty': 'l2'},
  {'logisticregression__C': 0.2, 'logisticregression__penalty': 'l2'},
  {'logisticregression__C': 0.25, 'logisticregression__penalty': 'l2'},
  {'logisticregression__C': 0.5, 'logisticregression__penalty': 'l2'},
  {'logisticregression__C': 1.0, 'logisticregression__penalty': 'l2'},
  {'logisticregression__C': 1.25, 'logisticregression__penalty': 'l2'},
  {'logisticregression__C': 1.5, 'logisticregression__penalty': 'l2'}],
 'split0_test_score': array([0.62917478, 0.62513105, 0.62408267, 0.61299985, 0.60551146,
        0.60356448, 0.59622585]),
 'split1_test_score': array([0.62417615, 0.62297783, 0.62028161, 0.61339125, 0.60080887,
        0.59781306, 0.59631516]),
 'split2_test_score': array([0.63071161, 0.62981273, 0.62846442, 0.61842697, 0.61003745,
        0.60674157, 0.60374532]),
 'mean_test_score': array([0.62802085, 0.62597387, 0.62427623, 0.61493936, 0.60545259,
        0.60270637, 0.59876211]),
 'std_test_score': array([0.00279007, 0.00285327, 0.00334342, 0.00247128, 0.00376778,
        0.00369521, 0.00352385]),
 'rank_test_score': array([1, 2, 3, 4, 5, 6, 7], dtype=int32),
 'split0_train_score': array([0.65980076, 0.67230919, 0.67567972, 0.69058497, 0.70856116,
        0.71230619, 0.71372931]),
 'split1_train_score': array([0.65975135, 0.67158478, 0.67622828, 0.69120731, 0.70408928,
        0.7083583 , 0.71068005]),
 'split2_train_score': array([0.66329664, 0.67325695, 0.67797499, 0.69317756, 0.70942859,
        0.71377219, 0.717966  ]),
 'mean_train_score': array([0.66094958, 0.67238364, 0.67662766, 0.69165662, 0.70735967,
        0.71147889, 0.71412512]),
 'std_train_score': array([0.00165974, 0.00068469, 0.00097867, 0.00110507, 0.00233948,
        0.00228632, 0.00298762])}

5. Separating Bag of Words and Dense Features

The bag of words contain more than 3000 features. Let me try to see their performance alone.

A. Only Bag of Words

In [302]:
X_train_bow_t[:,6].toarray()
Out[302]:
array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]])
In [307]:
model = estimate(X_train_bow_t[:, 6:], y_train_bow, X_test_bow_t[:,6:], 
                 y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)

analyze_features(all_the_features_bow[5:], model.coef_, limit=15)
 Confusion Matrix
Predicted    0     1
True                
0          660  1830
1          583  1935
              precision    recall  f1-score   support

           0       0.53      0.27      0.35      2490
           1       0.51      0.77      0.62      2518

   micro avg       0.52      0.52      0.52      5008
   macro avg       0.52      0.52      0.48      5008
weighted avg       0.52      0.52      0.49      5008

ROC-AUC 0.5167636391475353
PR-AUC-Score 0.5113629325041366
Test Accuracy:  0.5181709265175719
Train Accuracy:  0.5638605951667666
Most positive cofficeints
(contribute positively to restaurants closure)

good,food,place,order,great,breakfast,come,like,bar,egg,service,pizza,little,think,brisket,

Most negative cofficeints
(contribute negatively to restaurants closure)

taco,pho,mexican,chicken,gyro,long,philly,pie,cook,rice,grill,italian,cuban,indian,dog,

XGBoost

In [308]:
# xgboost
model = estimate(X_train_bow_t[:,6:], y_train_bow, X_test_bow_t[:,6:], 
                  y_test_bow, model_type="XGBoost")

analyze_features(all_the_features_bow, model_type="other", limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1161  1329
1          1040  1478
              precision    recall  f1-score   support

           0       0.53      0.47      0.49      2490
           1       0.53      0.59      0.56      2518

   micro avg       0.53      0.53      0.53      5008
   macro avg       0.53      0.53      0.53      5008
weighted avg       0.53      0.53      0.53      5008

ROC-AUC 0.5266194244810856
PR-AUC-Score 0.516733374568454
Test Accuracy:  0.5269568690095847
Train Accuracy:  0.6737567405632116
goat,fold,convenience,coast,extensive,hot,creek,honestly,oz,college,aioli,lemonade,flame,gentleman,housemade,

The words are similar to the onces obtained before. More research is required to know why these words are good predictors.

Applying LSA

Let us see if the bow model can be improved by applying Latent Sematic Analyisis. The LSA class is able to fit and transform LSA on the training set and use it to transform the test set

In [374]:
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer(), LSA(dim=500))
In [375]:
X_train_bow_lsa_t = pipeline.fit_transform(X_train_bow)
X_test_bow_lsa_t = pipeline.transform(X_test_bow)
Percent variance captured by all components: 65.76572986103656
In [377]:
X_train_bow_lsa_t.shape
Out[377]:
(20028, 506)
In [378]:
X_test_bow_lsa_t
Out[378]:
<5008x506 sparse matrix of type '<class 'numpy.float64'>'
	with 2534041 stored elements in Compressed Sparse Row format>
In [379]:
# logistic regression
model = estimate(X_train_bow_lsa_t.tocsr()[:,6:], y_train_bow, X_test_bow_lsa_t.tocsr()[:,6:], 
                 y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)

analyze_features(["svd_feature_"+ str(i) for i in range(500)], model.coef_, limit=15)
 Confusion Matrix
Predicted    0     1
True                
0          659  1831
1          579  1939
              precision    recall  f1-score   support

           0       0.53      0.26      0.35      2490
           1       0.51      0.77      0.62      2518

   micro avg       0.52      0.52      0.52      5008
   macro avg       0.52      0.52      0.49      5008
weighted avg       0.52      0.52      0.49      5008

ROC-AUC 0.51735711711022
PR-AUC-Score 0.5116727899224397
Test Accuracy:  0.5187699680511182
Train Accuracy:  0.5479828240463351
Most positive cofficeints
(contribute positively to restaurants closure)

svd_feature_27,svd_feature_0,svd_feature_15,svd_feature_28,svd_feature_25,svd_feature_10,svd_feature_26,svd_feature_29,svd_feature_24,svd_feature_60,svd_feature_78,svd_feature_9,svd_feature_85,svd_feature_45,svd_feature_57,

Most negative cofficeints
(contribute negatively to restaurants closure)

svd_feature_20,svd_feature_11,svd_feature_18,svd_feature_13,svd_feature_35,svd_feature_23,svd_feature_17,svd_feature_12,svd_feature_41,svd_feature_76,svd_feature_19,svd_feature_16,svd_feature_44,svd_feature_62,svd_feature_31,

LSA is able to increase the performance of logistic regression on the test-set with just 500 features. Overfitting is reduced as well. Next I will check the full model using only the 500 bag of words

In [380]:
# logistic regression
model = estimate(X_train_bow_lsa_t[:,1:], y_train_bow, X_test_bow_lsa_t.tocsr()[:,1:], 
                 y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)

new_set_of_features = first_5 + ["svd_feature_"+ str(i) for i in range(500)] 
analyze_features(new_set_of_features, model.coef_, limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1470  1020
1           801  1717
              precision    recall  f1-score   support

           0       0.65      0.59      0.62      2490
           1       0.63      0.68      0.65      2518

   micro avg       0.64      0.64      0.64      5008
   macro avg       0.64      0.64      0.64      5008
weighted avg       0.64      0.64      0.64      5008

ROC-AUC 0.6361259174904542
PR-AUC-Score 0.5877138367175856
Test Accuracy:  0.6363817891373802
Train Accuracy:  0.6428000798881566
Most positive cofficeints
(contribute positively to restaurants closure)

age_of_restaurant,average_stars,svd_feature_27,svd_feature_0,svd_feature_28,nearest_average_rating,svd_feature_15,svd_feature_24,svd_feature_78,svd_feature_26,svd_feature_29,svd_feature_85,svd_feature_25,svd_feature_45,svd_feature_60,

Most negative cofficeints
(contribute negatively to restaurants closure)

review_count,svd_feature_20,svd_feature_18,svd_feature_11,svd_feature_13,svd_feature_35,svd_feature_23,svd_feature_12,svd_feature_17,nearest_averge_number_of_reviews,svd_feature_41,svd_feature_4,svd_feature_31,svd_feature_76,svd_feature_44,

The full model didn't imporve by much. Creating custom ensembles might be a good idea.

In [381]:
# XGBoost
model = estimate(X_train_bow_lsa_t.tocsr()[:,6:], y_train_bow, X_test_bow_lsa_t.tocsr()[:,6:], 
                 y_test_bow, model_type="XGBoost")

analyze_features(["svd_feature_"+ str(i) for i in range(500)], model_type="other", limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1124  1366
1          1086  1432
              precision    recall  f1-score   support

           0       0.51      0.45      0.48      2490
           1       0.51      0.57      0.54      2518

   micro avg       0.51      0.51      0.51      5008
   macro avg       0.51      0.51      0.51      5008
weighted avg       0.51      0.51      0.51      5008

ROC-AUC 0.5100554720869179
PR-AUC-Score 0.5079130854122871
Test Accuracy:  0.5103833865814696
Train Accuracy:  0.6978729778310365
svd_feature_27,svd_feature_20,svd_feature_18,svd_feature_390,svd_feature_343,svd_feature_190,svd_feature_124,svd_feature_134,svd_feature_345,svd_feature_408,svd_feature_25,svd_feature_478,svd_feature_82,svd_feature_331,svd_feature_86,

There is a huge improvement for XGBoost. Let me check the full model

In [382]:
# XGBoost
model = estimate(X_train_bow_lsa_t.tocsr()[:,1:], y_train_bow, X_test_bow_lsa_t.tocsr()[:,1:], 
                 y_test_bow, model_type="XGBoost")
new_set_of_features = first_5 + ["svd_feature_"+ str(i) for i in range(500)] 

analyze_features(new_set_of_features, model_type="other", limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1527   963
1           777  1741
              precision    recall  f1-score   support

           0       0.66      0.61      0.64      2490
           1       0.64      0.69      0.67      2518

   micro avg       0.65      0.65      0.65      5008
   macro avg       0.65      0.65      0.65      5008
weighted avg       0.65      0.65      0.65      5008

ROC-AUC 0.6523373876762012
PR-AUC-Score 0.6003312283100328
Test Accuracy:  0.652555910543131
Train Accuracy:  0.7064110245656081
age_of_restaurant,review_count,nearest_average_rating,average_stars,nearest_averge_number_of_reviews,svd_feature_27,svd_feature_18,svd_feature_156,svd_feature_20,svd_feature_421,svd_feature_101,svd_feature_169,svd_feature_60,svd_feature_138,svd_feature_79,

XGBoost imporved from 50% to 64% percent

In [ ]:
# # lightgbm
# model = estimate(X_train_bow_lsa_t[:, 1:], y_train_bow, X_test_bow_t[:,1:], 
#                  y_test_bow, model_type="lightgbm")
# new_set_of_features = first_5 + ["svd_feature_"+ str(i) for i in range(500)] 

# analyze_features(new_set_of_features, model_type="lightgbm", limit=20)

LightGBM is still overfitted but its test score is slightly higher than XBBoost

B. Only Dense Features

The dense features include surrounding performance, I called them distance features. They are nearest_average_num_of_reviews, nearest_rating. The others are average_stars, review_count and age of restaurant

In [324]:
# logistic regression
model = estimate(X_train_bow_lsa_t.tocsr()[:,1:6], y_train_bow, X_test_bow_lsa_t.tocsr()[:,1:6], 
                 y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)

analyze_features(first_5, model.coef_, limit=5)
print("\n\nThe Coefficients\n")
print(first_5)
print(model.coef_.ravel())
 Confusion Matrix
Predicted     0     1
True                 
0          1396  1094
1           759  1759
              precision    recall  f1-score   support

           0       0.65      0.56      0.60      2490
           1       0.62      0.70      0.65      2518

   micro avg       0.63      0.63      0.63      5008
   macro avg       0.63      0.63      0.63      5008
weighted avg       0.63      0.63      0.63      5008

ROC-AUC 0.6296064320825797
PR-AUC-Score 0.5822568234243103
Test Accuracy:  0.6299920127795527
Train Accuracy:  0.6257239864190134
Most positive cofficeints
(contribute positively to restaurants closure)

age_of_restaurant,average_stars,nearest_average_rating,nearest_averge_number_of_reviews,review_count,

Most negative cofficeints
(contribute negatively to restaurants closure)

review_count,nearest_averge_number_of_reviews,nearest_average_rating,average_stars,age_of_restaurant,

The Coefficients

['nearest_average_rating', 'nearest_averge_number_of_reviews', 'average_stars', 'review_count', 'age_of_restaurant']
[ 0.08608656 -0.06650183  0.1090887  -0.54641509  0.45769093]

The age of a restaurant contributes the most to restaurant closure. It is interesting to see that the nereast average number of reviews contribute to closure while more reviews helps a restaurant to survive

In [325]:
# XGBoost
model = estimate(X_train_bow_lsa_t.tocsr()[:,1:6], y_train_bow, X_test_bow_lsa_t.tocsr()[:,1:6], 
                 y_test_bow, model_type="XGBoost")

analyze_features(first_5, model_type="other", limit=5)
 Confusion Matrix
Predicted     0     1
True                 
0          1754   736
1           982  1536
              precision    recall  f1-score   support

           0       0.64      0.70      0.67      2490
           1       0.68      0.61      0.64      2518

   micro avg       0.66      0.66      0.66      5008
   macro avg       0.66      0.66      0.66      5008
weighted avg       0.66      0.66      0.66      5008

ROC-AUC 0.6572128067472432
PR-AUC-Score 0.6084859979662427
Test Accuracy:  0.6569488817891374
Train Accuracy:  0.6756041541841422
age_of_restaurant,review_count,nearest_average_rating,average_stars,nearest_averge_number_of_reviews,

The dense features are better predictors of restaurants closure. They achieve around 63% accuracy on the testset. After applying LSA the bag of words model shown improvement. Logistic regression improved from 53% to 64% accuracy. When the two are combined the accuracy didn't improve by much even after the improvement of the bag of words features.

Next Task: Tune Hyperparameters separately.and combine them

6: Only TF-IDF Features + LSA

In [326]:
# logistic regression with tfidf features
model = estimate(X_train_tf_idf_t[:, 6:], y_train_tf_idf, X_test_tf_idf_t[:,6:], 
                 y_test_tf_idf, model_type="logistic_regression", penalty_arg='l2', C_arg=0.1)

analyze_features(all_the_features_tfidf[5:], model.coef_, limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1868   639
1           687  1814
              precision    recall  f1-score   support

           0       0.73      0.75      0.74      2507
           1       0.74      0.73      0.73      2501

   micro avg       0.74      0.74      0.74      5008
   macro avg       0.74      0.74      0.74      5008
weighted avg       0.74      0.74      0.74      5008

ROC-AUC 0.7352117788704223
PR-AUC-Score 0.6735490864588874
Test Accuracy:  0.735223642172524
Train Accuracy:  0.7441082484521669
Most positive cofficeints
(contribute positively to restaurants closure)

order,service,location,amazing,fast,staff,customer,recommend,delicious,busy,rude,great,love,come,line,

Most negative cofficeints
(contribute negatively to restaurants closure)

close,owner,think,menu,place,groupon,business,vegas,restaurant,sign,chef,new,open,lunch,like,

It is amazing to see the model with only tfidf features returning meaningful words, when it is trained separately

In [342]:
# logistic regression full model
model = estimate(X_train_tf_idf_t[:, 1:], y_train_tf_idf, X_test_tf_idf_t[:,1:], 
                 y_test_tf_idf, model_type="logistic_regression", penalty_arg='l2', C_arg=0.1)

analyze_features(all_the_features_tfidf, model.coef_, limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1891   616
1           664  1837
              precision    recall  f1-score   support

           0       0.74      0.75      0.75      2507
           1       0.75      0.73      0.74      2501

   micro avg       0.74      0.74      0.74      5008
   macro avg       0.74      0.74      0.74      5008
weighted avg       0.74      0.74      0.74      5008

ROC-AUC 0.7443970955694308
PR-AUC-Score 0.6826440701244968
Test Accuracy:  0.744408945686901
Train Accuracy:  0.7612842021170362
Most positive cofficeints
(contribute positively to restaurants closure)

location,fast,year,order,rude,service,busy,customer,quick,store,subway,drive,line,staff,usually,

Most negative cofficeints
(contribute negatively to restaurants closure)

close,owner,vegas,try,menu,new,open,think,place,flavor,groupon,business,chef,sign,restaurant,

tfidf features with the other features created the best model among all

In [346]:
# XGBoost with tfidf features
model = estimate(X_train_tf_idf_t.tocsr()[:,6:], y_train_tf_idf, X_test_tf_idf_t.tocsr()[:,6:], 
                 y_test_tf_idf, model_type="XGBoost")

analyze_features(all_the_features_tfidf, model_type="other", limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1904   603
1           706  1795
              precision    recall  f1-score   support

           0       0.73      0.76      0.74      2507
           1       0.75      0.72      0.73      2501

   micro avg       0.74      0.74      0.74      5008
   macro avg       0.74      0.74      0.74      5008
weighted avg       0.74      0.74      0.74      5008

ROC-AUC 0.7385931945530523
PR-AUC-Score 0.6782115894046392
Test Accuracy:  0.7386182108626198
Train Accuracy:  0.7786598761733573
clear,thanksgiving,receive,fare,lobby,grit,burn,overprice,melted,oppose,current,pita,vary,round,almond,

Applying LSA

In [383]:
# trying LSA
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"),  CustomNormalizer(), LSA(dim=500))

X_train_tf_idf_lsa_t = pipeline.fit_transform(X_train_tf_idf)
X_test_tf_idf_lsa_t = pipeline.transform(X_test_tf_idf)
Percent variance captured by all components: 71.55836699325043
In [384]:
X_train_tf_idf_lsa_t.shape
Out[384]:
(20028, 506)
In [385]:
# logistic regression with lsa on tfidf features
model = estimate(X_train_tf_idf_lsa_t.tocsr()[:,6:], y_train_tf_idf, X_test_tf_idf_lsa_t.tocsr()[:,6:], 
                 y_test_tf_idf, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)

analyze_features(["svd_feature_"+ str(i) for i in range(500)], model.coef_, limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1934   573
1          1002  1499
              precision    recall  f1-score   support

           0       0.66      0.77      0.71      2507
           1       0.72      0.60      0.66      2501

   micro avg       0.69      0.69      0.69      5008
   macro avg       0.69      0.69      0.68      5008
weighted avg       0.69      0.69      0.68      5008

ROC-AUC 0.6854001119934954
PR-AUC-Score 0.6336904048254013
Test Accuracy:  0.6855031948881789
Train Accuracy:  0.6878869582584382
Most positive cofficeints
(contribute positively to restaurants closure)

svd_feature_0,svd_feature_8,svd_feature_1,svd_feature_22,svd_feature_12,svd_feature_6,svd_feature_15,svd_feature_32,svd_feature_26,svd_feature_10,svd_feature_76,svd_feature_29,svd_feature_4,svd_feature_46,svd_feature_33,

Most negative cofficeints
(contribute negatively to restaurants closure)

svd_feature_3,svd_feature_17,svd_feature_20,svd_feature_13,svd_feature_7,svd_feature_43,svd_feature_21,svd_feature_44,svd_feature_25,svd_feature_53,svd_feature_2,svd_feature_28,svd_feature_55,svd_feature_11,svd_feature_19,
In [386]:
# logistic regression full model after lsa is applied to textual features
model = estimate(X_train_tf_idf_lsa_t[:,1:], y_train_tf_idf, X_test_tf_idf_lsa_t.tocsr()[:,1:], 
                 y_test_tf_idf, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)

new_set_of_features = first_5 + ["svd_feature_"+ str(i) for i in range(500)] 
analyze_features(new_set_of_features, model.coef_, limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1796   711
1           905  1596
              precision    recall  f1-score   support

           0       0.66      0.72      0.69      2507
           1       0.69      0.64      0.66      2501

   micro avg       0.68      0.68      0.68      5008
   macro avg       0.68      0.68      0.68      5008
weighted avg       0.68      0.68      0.68      5008

ROC-AUC 0.6772694193164378
PR-AUC-Score 0.6221842082620456
Test Accuracy:  0.6773162939297125
Train Accuracy:  0.6913321350109847
Most positive cofficeints
(contribute positively to restaurants closure)

review_count,svd_feature_1,svd_feature_8,svd_feature_6,svd_feature_32,svd_feature_10,svd_feature_15,svd_feature_26,svd_feature_22,svd_feature_4,svd_feature_12,svd_feature_46,svd_feature_5,svd_feature_76,svd_feature_45,

Most negative cofficeints
(contribute negatively to restaurants closure)

svd_feature_3,svd_feature_7,age_of_restaurant,svd_feature_20,svd_feature_17,svd_feature_13,svd_feature_43,svd_feature_44,svd_feature_53,svd_feature_16,svd_feature_11,svd_feature_55,svd_feature_2,svd_feature_73,svd_feature_37,
In [387]:
# XGBoost with lsa applied to tfidf features
model = estimate(X_train_tf_idf_lsa_t.tocsr()[:,6:], y_train_tf_idf, X_test_tf_idf_lsa_t.tocsr()[:,6:], 
                 y_test_tf_idf, model_type="XGBoost")

analyze_features(["svd_feature_"+ str(i) for i in range(500)], model_type="other", limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1796   711
1           734  1767
              precision    recall  f1-score   support

           0       0.71      0.72      0.71      2507
           1       0.71      0.71      0.71      2501

   micro avg       0.71      0.71      0.71      5008
   macro avg       0.71      0.71      0.71      5008
weighted avg       0.71      0.71      0.71      5008

ROC-AUC 0.71145574478625
PR-AUC-Score 0.6503654280190468
Test Accuracy:  0.7114616613418531
Train Accuracy:  0.7673257439584582
svd_feature_3,svd_feature_15,svd_feature_8,svd_feature_43,svd_feature_13,svd_feature_44,svd_feature_20,svd_feature_32,svd_feature_0,svd_feature_17,svd_feature_22,svd_feature_6,svd_feature_26,svd_feature_7,svd_feature_55,
In [388]:
# XGBoost full model
model = estimate(X_train_tf_idf_lsa_t.tocsr()[:,1:], y_train_tf_idf, X_test_tf_idf_lsa_t.tocsr()[:,1:], 
                 y_test_tf_idf, model_type="XGBoost")
new_set_of_features = first_5 + ["svd_feature_"+ str(i) for i in range(500)] 

analyze_features(new_set_of_features, model_type="other", limit=15)
 Confusion Matrix
Predicted     0     1
True                 
0          1959   548
1           688  1813
              precision    recall  f1-score   support

           0       0.74      0.78      0.76      2507
           1       0.77      0.72      0.75      2501

   micro avg       0.75      0.75      0.75      5008
   macro avg       0.75      0.75      0.75      5008
weighted avg       0.75      0.75      0.75      5008

ROC-AUC 0.7531610411280243
PR-AUC-Score 0.6940349546081164
Test Accuracy:  0.7531948881789138
Train Accuracy:  0.7881465947673257
age_of_restaurant,review_count,svd_feature_3,svd_feature_32,svd_feature_8,svd_feature_15,svd_feature_1,svd_feature_26,svd_feature_6,svd_feature_43,svd_feature_7,svd_feature_44,svd_feature_53,svd_feature_20,nearest_average_rating,

XGBoost with SVD accuracy is a little greater than Logistic Regression without SVD

Which words dominated the top components?

In [389]:
pipeline.named_steps['lsa'].svd.components_.shape
Out[389]:
(500, 3229)
In [394]:
# component 3
top_words = np.argsort(pipeline.named_steps['lsa'].svd.components_[3])[::-1]
for w in top_words[:15]:
    print(all_the_features_tfidf[w])
surprised
band
drawback
willing
beat
grateful
new
assorted
sense
melted
swiss
din
bare
resort
freshly
In [395]:
# component 32
top_words = np.argsort(pipeline.named_steps['lsa'].svd.components_[32])[::-1]
for w in top_words[:15]:
    print(all_the_features_tfidf[w])
tasting
starve
pretty
goat
bump
bruschetta
incorrect
iron
fine
pass
husband
pita
oppose
separate
fare

Health Inspection Data

Health inspection results seem highly predictive of restaurants closure. I tried to obtain data about the 11 meteropolitan areas that are used for this project. Most of those areas or cities don't have complete data. Southern nevada has complete data but only 1000 restaurants match the Yelp dataset. I used more than 30000 restaurants in this project. Thus integrating health inspection data is not possible. I can do a seprate project on those 1000 restaurants.

In [ ]:
# restaurant_inspection = pd.read_csv("../../../Data & Script/nevada_restaurant_inspection.csv").drop(["State", "Serial Number", "Employee ID", "Permit Number", "Inspection Time"], axis=1)
In [ ]:
# restaurant_inspection.head(3)
In [ ]:
# restaurant_inspection.columns

Technical Coaching Sessions

1

Technical Coaching APP [Today at 2:04 PM] [Open] Ticket :hash: 17246 (edited) Requested by @Tinsae G. Alemayehu There is latitude and longitude information. Is there anyway to split train and test stratified by geographical location? :rocket: API | Assignee: [Group] Data Science | Type: incident 47 replies

Technical Coaching APP [2 hours ago] @Tinsae, You've opened a Technical Coaching help ticket!

A technical coach will be with you when they're able. Review our schedule: http://bit.ly/tc-schedule

To make things move quicker, please include the following if appropriate:

• Checkpoint Link • GitHub link to the branch you're on

• Make sure code is up-to-date • What’s got you stuck? Be as specific as you can! • What have you tried so far?

Tiago [2 hours ago] Hey, Tinsae! Long time no see

Tinsae [2 hours ago] Hi Tiago. Glad you are here to help

Tiago [2 hours ago] so, let me see if I understand, you kinda want to split your points into, for example, north/south boundaries?

Tiago [2 hours ago] like having a line at the middle of them?

Tinsae [2 hours ago] I want both train and test sets to include businesses from all over the place

Tiago [2 hours ago] okay. Just trying to understand what you mean by "stratifying"

Tiago [2 hours ago] like, how do you want to separate the data?

Tinsae [2 hours ago] likethis.png

Tiago [2 hours ago] like an abstract art painting

Tinsae [2 hours ago] :slightly_smiling_face:

Tiago [2 hours ago] I can't really help you do that, I suck at art and did a degree in Mathematics, because that was as far as I could go

Tinsae [2 hours ago] The blue dots are on the trainset and the red dots are on the test set. I assumed 2D distance

Tiago [2 hours ago] but, if I understand correctly, you want to take random samplings space-wise. Is that it?

Tinsae [2 hours ago] Yes

Tiago [2 hours ago] that's a good doubt, and will have to research again how to do it

Tiago [2 hours ago] I like your questions, always things that are interesting

Tiago [2 hours ago] I'm trying to check if there is some already-made algorithm for it

Tinsae [2 hours ago] take your time if you have time

Tiago [2 hours ago] (if I were to do such a thing by bare hand, my first instinct is to index all the points using an R-Tree and then get one point from each of the clusters the R-Tree generated)

Tiago [2 hours ago] ((buuut I guess there are better ways to do that))

Tiago [2 hours ago] can you send me a subset of the data?

Tinsae [2 hours ago] you can use yelp businesses dataset

Tinsae [2 hours ago] let me see if I can find the data

Tiago [2 hours ago] oooooh, okay

Tiago [2 hours ago] no, no problems

Tiago [2 hours ago] and... you want to stratify your data space-wise to do what with it?

Tinsae [2 hours ago] finding nearest neighbors for each busienss

Tinsae [2 hours ago] and create aggreagated features like nearest average stars (edited)

Tinsae [2 hours ago] It was part of my supervised caspstone

Tinsae [2 hours ago] Rtrees look like KD trees. I am reading about Rtrees

Tiago [2 hours ago] please take a look into this: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

Tinsae [1 hour ago] Those are clustering algorithms. Do I put cluster labels on the data and choose data from each cluster proportionally

Tiago [1 hour ago] I was considering two ways of doing what you asked:

1) Just use geopandas, for each business, get all business within a certain distance of it, and create an aggregate core. The problem of it is: the distribution of stores is not uniform. 2) Use HDBSCAN or make clusters (I might like that algorithm a lot) setting min_size for each cluster

Tinsae [1 hour ago] I did excatly the same as the first method with geopy. The problem is, not finding neighbors within 1 mile for many businesses because the random split may select businesses that are far way apart. (edited)

Tinsae [1 hour ago] specially for the test-set because it is limited in size (edited)

Tinsae [1 hour ago] I am checking HDBSCAN

Tiago [1 hour ago] (just as a sidenote, geopandas' distance function should be faster than geopy's, because geopandas does indexing

Tiago [1 hour ago] Or should do scanning, at least)

Tinsae [45 minutes ago] is it vectorized

Tinsae [44 minutes ago] I forgot to tell you that I adopted geopys method to make it vectorized

Tiago [43 minutes ago] me likey! (seriously)

Tiago [43 minutes ago] how did you do that?

Tinsae [37 minutes ago] I copied the code and manipulated it using numpy. Not that difficulult

Tiago [32 minutes ago] It's just that, if I recall how numpy works, the vectorization is made one layer below, at the C (or C++, I always forget) code

Tiago [31 minutes ago] (I might be fantastically wrong, though)

Tinsae [26 minutes ago] I don't how it works. There is some C behind the scene.

2

Technical Coaching APP [Feb 21st at 10:05 PM] [Solved] Ticket :hash: 17173 (edited) Requested by @Ryan Hohenhausen In Unit 4.4.3 the latent semantic analysis coded example is: svd= TruncatedSVD(130) lsa = make_pipeline(svd, Normalizer(copy=False :rocket: API | Assignee: David Reed | Type: incident 10 replies

Technical Coaching APP [1 day ago] @Ryan Hohenhausen, You've opened a Technical Coaching help ticket!

A technical coach will be with you when they're able. Review our schedule: http://bit.ly/tc-schedule

To make things move quicker, please include the following if appropriate:

• Checkpoint Link • GitHub link to the branch you're on

• Make sure code is up-to-date • What’s got you stuck? Be as specific as you can! • What have you tried so far?

Tinsae [1 day ago] I want more experienced person to answer this questions as well.

Question 1

SVD works directly on the matrices as opposed to co-variance matrices. I guess standardizing data before PCA is related to its usage of the co-variance matrix.

Question 2

SVD accepts sparse matrices and it works without converting them to dense. Tf-idf vectorizer creates a sparse matrix. Standardizing before SVD will affect the sparsity of the matrix. The dense matrix will require a lot of memory and the kernel may die. (edited)

Ryan Hohenhausen [1 day ago] Thank you. That does make sense for why we don't standardize before SVD. I still am unsure why the example normalizes after SVD as opposed to doing nothing to the data. Would there be anything wrong with skipping the pipeline all together like this: svd= TruncatedSVD(130) X_train_lsa = svd.fit_transform(X_train_tfidf) (edited)

Tinsae [1 day ago] We standardize after to make them unit vectors. It will remove the effect of long paragraphs.

Tinsae [1 day ago] Let me give simple example with just word counts paragraph1 is 100 words paragraph2 is 10 words paragraph1 good:20 times, bad 30 times,.. paragraph2 good:2 times bad:3 times...

The vector (20,30) and (2,3) show similar distribution but the large document will be more influential if we don't normalize the vectors

To normalize, simply divide by the magnitude of the vector(aka L2 norm)

Tinsae [1 day ago] vec1 = np.array([20,30]) vec2 = np.array([2,3]) print(vec1/np.linalg.norm([20,30])) print(vec2/np.linalg.norm([2,3])) The above will give you [0.5547002 0.83205029] [0.5547002 0.83205029] (edited)

Ryan Hohenhausen [1 day ago] Okay that mostly clears it up.

Your example normalizes term frequency vectors. This problem normalizes SVD transformed tfidf vectors, but the same concept should still apply. Let me know if I'm wrong here.

Would there ever be a time when we would prefer not to normalize? Like in your example, could the absolute frequency of the terms potentially be more important than the relative frequencies? Or would this just be better handled by making paragraph length and/or vector length a feature on its own?

Tinsae [1 day ago] When you normalize, for example a short tweet with little information is treated the same as a long article on Wikipedia.

I cannot answer your question. You may try to engineer different set of features and see their performance.

Ryan Hohenhausen [1 day ago] Okay will do. Thank you! Your answers have been helpful.

David Reed (Agent) APP [1 day ago] I'm closing this one out.

3

[Solved] Ticket :hash: 17143 (edited) Requested by @Tinsae G. Alemayehu How much is big corpus to train a word2vect model. I have 5B user reviews. Is that good enough? The size of google news corpus is 100B. Does :rocket: API | Assignee: David Reed | Type: incident

68 replies

Technical Coaching APP [2 days ago] @Tinsae, You've opened a Technical Coaching help ticket!

A technical coach will be with you when they're able. Review our schedule: http://bit.ly/tc-schedule

To make things move quicker, please include the following if appropriate:

• Checkpoint Link • GitHub link to the branch you're on

• Make sure code is up-to-date • What’s got you stuck? Be as specific as you can! • What have you tried so far?

Tiago [2 days ago] Hey, Tinsae. You have 5 billion user reviews, like, actual reviews? Wow. Where did you get that?

Tiago [2 days ago] And let me take a look at the Google News Corpus.

Tinsae [2 days ago] Sorry I meant 5 million :slightly_smiling_face:

Tiago [2 days ago] But unfortunately I have no specific answer for you, I don't know what would be a good size.

Tinsae [2 days ago] you may check this https://github.com/3Top/word2vec-api GitHub 3Top/word2vec-api Simple web service providing a word embedding model - 3Top/word2vec-api

Tiago [2 days ago] thanks!

Tinsae [2 days ago] The size of google news corpus is 100B. Does that mean 100B words?

Tiago [2 days ago] I am trying to understand what is that 100B. Because the vocabulary size is 3 million

Tiago [2 days ago] (and anyway, where did you get the 5 million user reviews? )

Tiago [2 days ago] (not pointing fingers, just "oh, cool, I might do some cool stuff with that")

Tinsae [2 days ago] Yelp reviews from kaggle

Tiago [2 days ago] oh, okay

Tiago [2 days ago] another question: why are you using that word2vec-api?

Tinsae [2 days ago] I am not using it

Tiago [2 days ago] oooooooh, okay

Tinsae [2 days ago] The link is given in the course to check out pretrained wordnetvectors

Tiago [2 days ago] FOUND IT: about 100 billion words

Tiago [2 days ago] https://code.google.com/archive/p/word2vec/

Tiago [2 days ago] And so, I ask you, how many words do all of your reviews have?

Tinsae [2 days ago] I don't know the actual number. I created a bag of words with min_df=0.001 and obtained 3000 words out of 1M reviews.

Tiago [2 days ago] which function did you use for the bag of words?

Tinsae [2 days ago] CountVectorizer

Tiago [2 days ago] okay

Tiago [2 days ago] and, do you have a link to the raw data?

Tiago [2 days ago] or to the kaggle page

Tinsae [2 days ago] It will take more than 2 hours to count the words. I thought we could guess the word count.

https://www.kaggle.com/yelp-dataset/yelp-dataset/kernels kaggle.com Yelp Dataset A trove of reviews, businesses, users, tips, and check-in data!

Tiago [2 days ago] Are you doing the count single-threaded? Only one processor core?

Tinsae [2 days ago] njobs=-1

Tiago [2 days ago] where do you use that param njobs?

Tinsae [2 days ago] I went back to the code to find that I didn't use CountVectorizer in parallelize form. It doesn't have njobs kwarg. (edited)

Tinsae [2 days ago] But the virtual machine I used has 8 cores with 30GB ram

Tiago [2 days ago] another way to check if it uses parallelization under the core is to run the thing and monitor CPU Usage (with something like htop)

Tiago [2 days ago] (I am saying this, because a looot of times I found something was parallelized by default, got a beefy VM and saw only one processor being used)

Tinsae [2 days ago] The cpu utilization never crossed 50% (edited)

Tiago [2 days ago] what did you use to check CPU Usage?

Tinsae [2 days ago] GCP compute engine has a monitor page for every VM

Tiago [2 days ago] oh, okay

Tiago [2 days ago] well, anyway, if it wasn't bigger than 50%, definitely not parallel

Tinsae [2 days ago] monitoring.png

Tiago [2 days ago] Well, I see two ways forward here:

1) Just train the model and let's see what happens 2) I go to google colab and do some parallel code to find out how many words that corpus has

Tiago [2 days ago] I am wiling to do the second route because not many people here now. But I want to finish my avocado first.

Tinsae [2 days ago] :avocado: ok.

Tinsae [2 days ago] Can you download and upload 5GB data easily? It is large json file. (edited)

Tiago [2 days ago] my idea is to download from kaggle straight to google colab's notebook

Tinsae [2 days ago] k

Tiago [2 days ago] btw, if you have any direct links just laying around, now is the time to share

Tinsae [1 day ago] ```# import neccessary libraries from ftplib import FTP import requests

login to ftp server server = "##.##.###.##" username = "**@tinsaealemayehu.com" password = "**" ftp = FTP(server) ftp.login(user=username, passwd=password)

rvfile = open("yelp_academic_dataset_review.json", "wb") ftp.retrbinary('RETR yelp_academic_dataset_review.json', rvfile.write) rvfile.close()``` (edited)

Tinsae [1 day ago] You can use the above code. I sent you the server, username password through PM (edited)

Greg [1 day ago] I have trained doc2vec models on considerably less text, and those are just fancy word2vec models, so my 2cents are to go for it.

Tinsae [1 day ago] Thanks @Greg. Just for fun, I searched "2cents" using google news vectors and found a bunch of rock bands 'Sworn_Enemy', 0.6559710502624512), ('Hinder_Papa_Roach', 0.6507534980773926), ('KILL_HANNAH', 0.6474124193191528), ('Singer_Jacoby_Shaddix', 0.645458459854126)

Tiago [1 day ago] Thanks, Greg!

Tiago [1 day ago] But I think I'll do the useless coding now just for the sake of fun

Tiago [1 day ago] Aaaand the results are just in: 6685900 words

Tiago [1 day ago] so, almost 7 million words

Tiago [1 day ago] in total

Tiago [1 day ago] now, to count uniques

Tinsae [1 day ago] Nice!! waiting for the unique counts

Tiago [1 day ago] Aaand the count just used all the RAM!

Tinsae [1 day ago] You spent enough time in it. If there is a chance you could share your code. I would try it in my 56gb ram virtual machine. (edited)

Tiago [1 day ago] import pandas as pd import dask.dataframe as dd !wget -c "https://storage.googleapis.com/kaggle-datasets/10100/277695/yelp-dataset.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1551046405&Signature=p%2BBda9sOLceu8EceqHAhrOPKgraMDDfZ8%2FqiW1z8DygzgoJaUzBW5xL8MWtD5z55OVJWe61Zv9qpdVeHCXcZ0r4mpV7dezhwA2Mmus2OyJqAhhWoXxPSieNB36fHaFmrm6xu%2FOEp5R1TJ2s72vWKU7tnOJS9%2BFBYnvOuV6cPN4lrpVVTrixOjCLv8QQWndfGR7V1Wcz2MqLaqAVFWuN94WU6ycz%2BlcSvdgqPplNHmmzcT%2BBvepxcuTheWMPuHusGUPJ1FHHQ4BEv8RgrhtTvXtV6jxupyw69eMHWpdN0J4bWHACc2%2BGthxgK00Tplt%2FrINAFiPATMf7dpciyNKigAA%3D%3D" -O yelp.zip !unzip yelp.zip ddf = dd.read_json('yelp_academic_dataset_review.json', blocksize=2**28) textao = ddf.text.to_bag() size_all_itens = textao.count().compute() distinct_items = textao.distinct().compute() ```

Tiago [1 day ago] soo.... apparently, only 4388 words

Tiago [1 day ago] that seems too little

Tiago [1 day ago] aaaand my code is broken

Tiago [1 day ago] anyway, going to close the ticket because the issue is solved (I guess)

Tinsae [1 day ago] Thank you very much. @Tiago. We had a very productive session. I learned a lot of new things (edited)

Tiago [1 day ago] thanks so much! Glad to hear it helped

David Reed (Agent) APP [1 day ago] I'm closing.

4

Crystal [2:01 PM] For the U4 capstone, I have data I performed tfidf on and then SVD to reduce the number of features. When I do a random forest classifier on it and look at the feature importances, is there a way to map those feature importances back to the original terms? So far I have: importances = rfc.featureimportances indices = np.argsort(importances)[::-1] (this gives the indices of important features in the SVD feature space)

I know there is a svd.inverse_transform() and vectorizer.inverse_transform(), I'm just having trouble figuring out what I would apply those too.

Thanks! (edited)

Tinsae [10 hours ago] Hi Crystal. SVD transformed features don't have name, right? We just call them feature0, feature1, ... Random forest may give feature importances like this: feature3 feature1 feature4

What if you do argsort on each one to see which original features dominated?

Tinsae [10 hours ago] I think, the above will not solve the issue. Doing argsort on each one will give which observations dominated.

Tinsae [10 hours ago] Assume feature3 is important. It is a projection from component3. What about doing argsort on svd.components[3]. It shows which original feature dominated component3

Crystal [10 hours ago] svdindices = np.argsort(svd2.components) words = list(df_tfidf2.columns)

features = [] for index in svd_indices: feature = [] This range is how many words I want for i in range(5): for counter, value in enumerate(index): if value == i: feature.append(words[counter]) features.append(feature)

Crystal [10 hours ago] not elegant, but I think this might be doing it

Crystal [10 hours ago] I think svd.components_ tells the relative importance of each component, so then if I argsort those, it gives the ranking of the words in terms of importance

Crystal [10 hours ago] so then when I use the indices from the rfc feature_importances, it seems to be working: Feature ranking:

  1. feature 1 (0.0769) ['sirians', 'slump', 'emotion', 'television', 'bay']
  2. feature 3 (0.0593) ['civilization', 'mess', 'rain', 'los', 'stand']
  3. feature 2 (0.0557) ['curb', 'hear', 'orphalese', 'long', 'skipper']

Tinsae [10 hours ago] My previous mentor didn't want to comment on it. According to my references, it seems right. I assumed more variance is more importance.

I did similar thing using PCA

loading_scores = pd.Series(pca.components_[0], index=genes) sorted_loading_scores = loading_scores.abs().sort_values(ascending=False) sorted_loading_scores[0:10].index

Index(['gene49', 'gene20', 'gene42', 'gene70', 'gene67', 'gene63', 'gene100', 'gene98', 'gene44', 'gene73'], dtype='object')

Tinsae [10 hours ago] @mguner may comment on this (edited)

Crystal [10 hours ago] thanks

Mike Swirsky (Agent) APP [10 hours ago] Take a peek into Unit 6 Lesson 4 - advanced NLP. There are some convenience functions in section 5 for finding the top words associated with each topic AKA component of your topic model AKA latent semantic analysis.

Tinsae [10 hours ago] Never thought about this " is not the best technique because of its use of negative loadings."

Mike Swirsky (Agent) APP [9 hours ago] Negative loadings aren't always bad. It just means that if a word occurs in a document, it's unlikely that another topic applies to it. It can be meaningful, it's just not as obvious to interpret.

mguner [9 hours ago] Hi @Tinsae and @Crystal and everyone else. I don't really know an answer for your question but I thought you might find helpful the following link if you haven't see it : https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca Also I looked at a little bit the documentation in scikit-learn but didn't see just SVD. Did you used TruncatedSVD or something else ? Cross Validated Relationship between SVD and PCA. How to use SVD to perform PCA? Principal component analysis (PCA) is usually explained via an eigen-decomposition of the covariance matrix. However, it can also be performed via singular value decomposition (SVD) of the data mat...

Tinsae [9 hours ago] Great!!

Mike Swirsky (Agent) APP [9 hours ago] Yeah we are talking about using truncatedSVD as part of an LSA process.

mguner [9 hours ago] Hi @Mike Swirsky, thanks a lot for your response. And @Crystal, back to your original question, I feel like you cannot find a one to one relationship between the original features and the new features. But you can look at feature importance as I guess they look at the inner products of the original features column and the new features column. Especially in TruncatedSVD you reduce the dimensionality, so you lose information. Finally I think, inverse_transform just reverse the SVD but again it mashes the features during the procedure to return the original X. I hope, this helps. :